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Downloading  music  from  the  Internet  is  not  illegal.  Plenty  of  music  available  online  is  not 
just  free  but  also  easily  available,  legal  and  -  most  important  -  worth  hearing. 

That  fact  may  come  as  a  surprise  after  highly  publicized  lawsuits  by  the  Recording 
Industry  Association  of  America,  representing  major  labels,  against  fans  using  peer-to-peer 
programs  like  Grokster  and  EDonkey  to  collect  music  on  the  Web. 

But  the  fine  print  of  those  lawsuits  makes  clear  that  fans  are  being  sued  not  for 
downloading  but  for  unauthorized  distribution:  leaving  music  in  a  shared  folder  for  other 
peer-to-peer  users  to  take.  As  copyright  holders,  the  labels  have  the  exclusive  legal  right 
to  distribute  the  music  recorded  for  them,  even  if  technology  now  makes  that  right  nearly 
impossible  to  enforce. 

While  the  recording  business  litigates  and  lobbies  over  music  being  given  away  online, 
countless  musicians  are  taking  advantage  of  the  Internet  to  get  their  music  heard.  They 
are  betting  that  if  they  give  away  a  song  or  two,  they  will  build  audiences,  promote  live 
shows  and  sell  more  recordings. 

The  first  place  to  look  for  free  music  online  is  at  the  musicians'  own  sites.  Many 
performers,  from  Bob  Dylan  (www.bobdylan.com)  to  the  Yeah  Yeah  Yeahs 
(www.yeahyeahyeahs.com),  post  hard-to-find  songs  for  listening:  some  as  free  downloads, 
some  as  streaming  audio  (which  can  be  recorded  with  a  free  program  like  StepVoice  at 
wflww.stepvoice.com). 

A  next  place  to  look  is  the  labels,  particularly  independent  rock  and  electronic  labels  like 
Matador  (www.matadorrecords  .com/music/mp3s.html).  Vagrant  (www. vagrant 
.com/vagrant/$  audio/audiojsp),  Barsuk  (www.barsuk  .com).  Saddle  Creek 
(wwAV.saddle-creek.com)  or  Tigerbeat6  (www.tigerbeat6.com/html/catalogue.htm). 

Many  public  U.S.  radio  stations  also  maintain  music  archives  for  streaming  or  downloading. 
Among  them  are  the  classical-music  station  WNYC  (www  .wnyc.org)  and  eclectic  stations 
like  WFMU  in  Jersey  City  (wAvw.wfmu.org)  and  KCRW  in  Santa  Monica,  California. 
(wAvw.kcrw.org),  all  of  which  have  troves  of  live  performances.  MTV  (at  wAvw.mtv.com) 
presents  an  entire  album  each  week  as  an  audio  stream. 

Following  is  a  selection  of  other  sites  offering  free  music  online.  Most  of  them  are  best 
used  with  a  either  a  broadband  connection  or  infinite  patience.  While  major-label  recordings 
are  largely  (but  not  entirely)  off  limits,  there  is  more  than  enough  available  music  to  satisfy 
every  listener. 

Epitonic:  The  first  and  best  place  to  look  for  any  band  with  an  independent  recording  is 
www.epitonic.com,  a  superbly  organized  site  that  is  likely  to  have  music  from  nearly 
everyone  heard  on  college  radio.  It  includes  not  only  downloadable  songs  but  also 
biographical  information  and  links  for  hundreds  of  acts,  grouped  under  genres  and 
subgenres. 

And  it  has  an  invaluable  "Similar  Artists"  feature  that  can  direct  fans  of  one  band  to 
dozens  of  potential  new  favorites.  Within  Epitonic's  huge  roster  is  at  least  a  song  or  two 
from  some  major-label  acts,  among  them  the  New  York  band  Secret  Machines,  the  Texas 
band  Sparta  and  the  English  bands  Radiohead  and  Spiritualized.  But  independent  bands 
like  Bright  Eyes  or  Godspeed  You  Black  Emperor  are  every  bit  as  good. 

Webjay:  At  wAvw.webjay.org,  music  fans  share  their  Web  finds  with  the  world.  There  is  no 
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music  on  the  site,  just  lists  of  links  that  allow  users  either  to  play  entire  lists  or  to 
download  items  directly  one  by  one.  Webjay  is  something  like  the  lists  submitted  by 
customers  at  www.amazon.com,  but  with  connections  to  the  music  itself  As  such,  it's  only 
as  good  as  the  widely  varied  skills  of  its  contributors,  and  its  links  are  not  always 
dependable. 

Furthurnet:  At  www.furthurnet.com,  this  is  a  peer-to-peer  network  that  trades  only 
recordings  of  bands  that  encourage  listeners  to  record  concerts:  not  just  the  Grateful 
Dead  but  Phish,  Gov't  Mule,  Dave  Matthews  Band,  Los  Lobos,  Wilco  and  David  Byrne  as 
well.  Users  need  to  install  a  program  available  on  the  Web  site.  Most  of  the  available 
concert  recordings  don't  use  MP3  files,  but  a  better  quality  audio  format,  SHN,  which  also 
requires  some  software  installation.  It's  easy;  information  on  the  site  explains  all  the 
technicalities. 

Another  connection  for  jam  bands  is  www.etree.org,  which  points  listeners  toward 
recordings  stored  online  and  is  equally  fastidious  about  high  fidelity.  Meanwhile,  concert 
recordings  of  all  sorts,  from  vintage  1960s  bootlegs  to  music  only  a  few  days  old,  have 
been  traded  at  www.sharingthegroove.org,  although  the  site  is  currently  undergoing 
maintenance. 

The  Library  of  Congress:  Through  the  years,  tax  dollars  have  supported  researchers  like 
Alan  Lomax  on  excursions  to  collect  music  from  every  nook  and  cranny  and  tradition  they 
could  discover  across  the  United  States.  The  Library  of  Congress  has  made  a 
considerable  amount  available  free  online.  A  place  to  start  is  the  American  Memory 
Collection  (memory. loc.gov/ammem/$  audio.html),  with  fiddle  tunes,  American  Indian  music, 
border  music  from  the  Rio  Grande,  Dust  Bowl  songs  and  more. 

Folkways  Records:  In  1987,  the  Smithsonian  Institution  bought  the  catalog  of  Folkways 
Records,  which  had  set  out  to  document  every  sound  in  the  world  and  continues  to 
support  projects  like  a  20-disc  collection  of  Indonesian  music.  Many  of  the  Folkways 
recordings  can  be  heard  on  the  Web  at  www.folkways.si.edu. 

Internet  Archive:  The  Internet  Archive  (www.archive.org)  includes  a  Live  Music  Archive 
with  more  than  10,000  concerts  via  etree.org.  Most  are  from  jam  bands,  but  there  is  plenty 
to  choose  from.  The  archive  also  includes  an  assortment  of  other  audio  under  All 
Collections,  which  has  131  songs  from  78-rpm  discs,  and  more  than  3,000  songs  on  what 
it  calls  net  labels,  most  of  them  releasing  electronic  music. 

lUMA:  The  Internet  Underground  Music  Archive  (www.iuma.org)  was  a  pioneer  of  free 
Internet  music.  It  was  founded  in  1993  as  a  place  for  musicians  to  post  their  own  music 
online,  and  it  just  keeps  on  expanding.  Unfortunately,  it  is  both  overwhelming  and 
overwhelmed;  finding  a  good  song  requires  extraordinary  luck,  and  downloading  it  will  take 
a  while.  Like  the  other  send-it-yourself  sites  noted  here,  lUMA  can  make  a  user  appreciate 
what  record  company  scouts  do. 

Garageband:  Hopefuls  face  Darwinian  competition  at  www.garageband.com,  where 
musicians  are  encouraged  to  rate  30  songs  before  submitting  one  of  their  own  (or  pay  a 
$19.99  fee  instead)  and  other  listeners  are  also  assigned  tracks  to  rate.  The  songs  that 
rise  to  the  top  of  the  charts  have  a  chance  to  be  heard  on  Garageband's  radio  outlets  or 
collected  on  its  compilation  albums. 

CNet:  The  computer  experts  at  CNet  include  an  extensive  selection  of  music  among  their 
software  downloads  at  music.download.com.  A  vast  bulk  of  the  music  is  submitted  by 
musicians  themselves. 

Vitaminic:  Founded  in  Italy,  Vitaminic  operates  in  nine  European  countries  through 
local-language  Web  sites  (wAww.vitaminic.co.uk,  .de,  .fr,  etc.).  It  offers  tens  of  thousands 
of  aspiring  bands  and  a  smattering  of  better-known  acts,  although  brand-name  bands  like 
Franz  Ferdinand  tend  to  offer  only  streaming  audio  rather  than  downloads.  But  the  site  is 
well  organized  and  also  includes  video  clips  from  the  likes  of  Nick  Cave. 

BeSonic:  A  site  founded  in  Germany  where  musicians  can  place  their  songs  online, 
wAww.besonic.com  has  a  slightly  more  international  perspective  than  the  other  newcomer 
sites.  Rankings  and  recommendations  help  visitors  sift  the  material.  Registration  is 
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required  for  downloading. 

Pure  Volume:  More  than  76,000  songs  are  available  at  yet  another  site  for  aspiring 
musicians,  www.purevolume.com,  which  is  strongly  weighted  toward  rock.  To  winnow  the 
site,  try  the  Pure  Picks  column  or  look  under  the  category  Music  for  Top  Artists  (Signed). 

DMusic:  Musicians  can  also  post  their  own  songs  on  DMusic  (www.dmusic.com).  It  helps 
users  wade  through  more  than  17,000  acts  -  an  overwhelming  majority  categorized  as 
alternative  or  rock  -  by  listing  DM  Picks  and  by  having  users  give  songs  a  thumbs-up  or 
thumbs-down  and  append  comments. 

Smart-Music:  Dance-music  experimenters  dominate  at  wvw/. smart-music. net,  a  selective 
site  that  draws  its  downloadable  MP3's  from  hard-to-find  small  labels. 

Ragga-Jungle:  Slow,  deep  reggae  bass  lines  are  the  foundation  for  whole  families  of  dance 
music  represented  at  www.ragga-jungle.com. 

Classic  Cat:  With  so  much  classical  music  in  the  public  domain,  it's  a  surprise  that  there 
aren't  more  free  downloadable  sites  offering  it,  although  the  length  of  classical 
compositions  can  make  them  inconvenient  to  download.  At  www.classiccat.net,  it's 
possible  to  search  by  composer,  from  Monteverdi  to  Messiaen.  The  selection  is  spotty  and 
links  don't  always  work,  but  it's  a  start. 
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Amazon  to  Take  Searches  on  Web  to  a  New  Depth 

By  JOHN  MARKOFF 

PALO  ALTO,  Calif.,  Sept.  14  -  Amazon.com,  the  e-commerce  giant,  plans  to  take  aim  at  the  Internet 
search  king  Google  with  an  advanced  technology  that  the  company  says  will  take  searches  beyond 
mere  retrieval  of  Web  pages  to  let  users  more  fully  manage  the  information  they  find. 

A9.com,  a  start-up  owned  by  Amazon,  said  in  a  briefing  here  on  Tuesday  that  it  planned  to  make  the  new 
version  of  its  search  service,  named  A9.com,  available  Tuesday  evening.  The  service  will  offer  users  the 
ability  to  store  and  edit  bookmarks  on  an  A9.com  central  server  computer,  keep  track  of  each  link  clicked 
on  previous  visits  to  a  Web  page,  and  even  make  personal  "diary"  notes  on  those  pages  for  viewing  on 
subsequent  visits. 

"In  a  sense,  this  is  a  search  engine  with  memory,"  said  Udi  Manber,  a  computer  scientist  who  was  a 
pioneer  in  online  information  retrieval  and  worked  at  Yahoo  before  moving  to  Amazon  two  years  ago. 

Mr.  Manber  created  the  original  A9  search  service,  which  is  based  in  part  on  search  results  from  Google. 
He  also  led  the  development  of  Amazon's  "search  inside  the  book"  project,  which  lets  visitors  to  the 
Amazon.com  and  A9.com  Web  sites  search  the  complete  contents  of  more  than  100,000  books  the 
company  has  digitally  scanned. 

Amazon's  entry  into  the  search  engine  wars  will  certainly  raise  the  stakes  in  an  already  heated  battle  for 
control  of  what  is  believed  to  be  the  high  ground  in  Internet  commerce  and  advertising. 

Google,  which  had  a  widely  watched  public  stock  offering  last  month,  is  still  the  dominant  provider  of 
search  results  with  approximately  250  million  daily  searches.  But  Yahoo  and  Microsoft  have  become 
direct  competitors,  and  a  number  of  start-up  companies  are  busy  developing  search  technologies. 

Google  executives  did  not  return  calls  asking  for  comment. 

Amazon  is  also  offering  a  dialog  box  that  will  enable  customers  on  the  Amazon.com  shopping  site  to  use 
A9  service  to  perform  Web  searches.  Company  executives  say  they  have  no  immediate  plans  to  compete 
head-on  with  Google  and  the  other  search  providers.  But  analysts  say  the  company  is  aware  that  search 
engines  are  often  the  starting  point  for  online  shopping  and  cannot  help  but  see  broader  business 
opportunities  for  expanding  more  fully  into  online  searching. 

"They've  downplayed  the  idea  that  they're  going  into  search,"  said  Danny  Sullivan,  editor  of  Search 
Engine  Watch,  an  industry  Web  site.  "They  say,  'we're  not  competing.'  But  at  the  same  time  you  have  to 
wonder  why  they're  doing  it,  and  it's  likely  they're  doing  it  because  they  see  some  potential  in  search." 

Amazon  quietly  established  A9  last  year  as  a  subsidiary  in  a  large  office  building  here.  The  start-up  has 
been  offering  a  search  demonstration  page,  which  has  so  far  been  limited  to  the  ability  to  record  a  history 
of  Web  searches. 

The  new  service  goes  much  further,  adding  the  ability  to  organize  and  retrieve  past  searches.  The  idea  is 
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to  make  searching  more  useful  by  making  it  easier  to  remember  wiiere  a  Web  browser  has  gone  before. 

"The  ability  to  search  through  your  own  history  of  personal  Web  searches  is  insanely  powerful,"  said 
John  Battelle,  a  writer  and  consultant  who  is  the  organizer  of  the  Web  2.0  conference  to  be  in  San 
Francisco  next  month.  "This  is  a  big  deal,"  Mr.  Battelle  said.  "But  the  question  is  will  people  get  the  habit 
of  using  it?" 

The  new  A9  search  page  permits  users  to  search  the  Web  and  simultaneously  retrieve  related  information 
from  Google's  search  results  and  its  image  search  service,  reference  material  from  the  GuruNet  service 
and  additional  information  from  the  Internet  Movie  Database. 

A9  executives  said  that  the  new  version  of  the  service  was  simply  a  first  release  and  that  the  company  had 
extensive  plans  for  adding  new  capabilities. 

"This  is  just  version  1.0,"  said  Mr.  Manber.  "There  is  a  lot  more  to  come." 

But  Mr.  Manber,  who  began  working  on  information  retrieval  in  the  early  1990's  as  a  faculty  member  at 
the  University  of  Arizona,  was  reticent  to  discuss  whether  A9  would  become  a  direct  competitor  to 
Google. 

A9  is  currently  using  Google  search  results  and  displaying  the  syndicated  Google  Adwords 
advertisements.  The  two  companies  share  revenue  from  the  advertisements.  Amazon  also  has  its  own 
independent  technology  for  indexing  the  Web,  as  a  result  of  its  purchase  in  1999  of  Alexa,  a  search 
company  founded  by  the  information  retrieval  specialist  Brewster  Kahle.  The  new  version  of  A9  offers 
some  Web  traffic  information  derived  from  Alexa,  but  not  search  results. 

Initially,  A9  will  focus  on  managing  information  like  bookmarks  and  search  history,  Mr.  Manber  said. 
"It's  not  just  about  search,"  he  said.  "It's  about  managing  your  information." 

The  A9  service  will  include  a  Web  browser  tool  bar  that  has  several  innovative  features,  like  the  ability  to 
create  instant  lists  from  individual  Web  pages  and  then  use  the  lists  to  move  among  those  pages. 

Moreover,  it  will  offer  a  home  page  giving  users  the  ability  to  edit  and  move  Web  links  easily  for  later 
retrieval. 

The  A9  site  will  also  offer  a  "discovery"  feature  that  gives  Internet  browsers  suggestions  on  Web  sites 
that  they  may  find  interesting,  based  on  their  searches  -  a  feature  similar  to  the  product  recommendation 
features  offered  on  Amazon. 

Mr.  Manber  said  that  A9  had  no  current  plans  to  include  paid  ads  in  search  research  or  to  give  a 
preference  to  products  sold  on  Amazon.  But  he  also  said  that  he  could  not  comment  on  future  plans, 
except  to  say  that  A9  did  have  plans  for  new  search  technologies  that  would  generate  revenue. 

He  stressed  that  the  evolution  of  Internet  search  capabilities  was  still  in  its  earliest  stages.  "We're  in  the 
Wright  brothers  phase  of  search  technology,"  he  said. 

A9  executives  said  they  were  acutely  aware  of  potential  privacy  concerns  raised  by  the  personalized 
nature  of  the  service  and  said  they  were  doing  a  variety  of  things  to  address  the  issue. 

There  will  be  a  version  of  the  A9  service  that  will  offer  anonymous  searches,  for  example,  Mr.  Manber 
said.  Moreover,  it  will  be  possible  to  turn  off  the  history  feature,  remove  information  from  an  individual 
history  list  and  even  entirely  clear  the  history  results  that  are  stored  on  the  A9  server,  he  said. 

"The  new  thing  here  is  not  that  this  information  is  being  collected,"  he  said,  but  rather  that  A9  is  actually 
letting  Web  users  have  access  to  their  browsing  histories  for  their  own  purposes. 
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CRITICS  NOTEBOOK 

No  Fears:  Laptop  D.J.'s  Have  a  Feast 

By  JON  PARELES 

DOWNLOADING  music  from  the  Internet  is  not  illegal.  Plenty  of  music  available  online  is  not  just  free 
but  also  easily  available,  legal  and  —  most  important  —  worth  hearing. 

That  fact  may  come  as  a  surprise  after  highly  publicized  lawsuits  by  the  Recording  Industry  Association  of 
America,  representing  major  labels,  against  fans  using  peer-to-peer  programs  like  Grokster  and  EDonkey  to 
collect  music  on  the  Web.  But  the  fine  print  of  those  lawsuits  makes  clear  that  fans  are  being  sued  not  for 
downloading  but  for  unauthorized  distribution:  leaving  music  in  a  shared  folder  for  other  peer-to-peer  users 
to  take.  As  copyright  holders,  the  labels  have  the  exclusive  legal  right  to  distribute  the  music  recorded  for 
them,  even  if  technology  now  makes  that  right  nearly  impossible  to  enforce. 

Recording  companies  have  tried  and  failed  to  shut  down  decentralized  file-sharing  networks  the  way  they 
closed  the  original  Napster.  (That  name  is  now  being  used  for  a  paid-download  service.) 

Courts  have  ruled  that  the  services  can  continue  because  they  are  also  used  to  exchange  material  that  does  not 
infringe  on  recording-company  copyrights.  At  the  same  time,  a  bill  before  Congress,  the  Inducing 
Infringement  of  Copyrights  Act  of  2004,  seeks  to  restrict  the  way  file-sharing  programs  are  constructed. 

While  the  recording  business  litigates  and  lobbies  over  music  being  given  away  online,  countless  musicians 
are  taking  advantage  of  the  Internet  to  get  their  music  heard.  They  are  betting  that  if  they  give  away  a  song  or 
two,  they  will  build  audiences,  promote  live  shows  and  sell  more  recordings. 

As  with  the  rest  of  the  free  content  on  the  Internet,  there's  no  guaranteed  quality  control.  Lucas  Gonze,  whose 
webjay.org  lets  music  fans  post  playlists  that  connect  to  free  music  and  video,  describes  free  Internet  music 
as  "a  flea  market  the  size  of  Valhalla." 

The  first  place  to  look  for  free  music  online  is  at  musicians'  own  sites.  Many  performers,  from  Bob  Dylan 
(www.bobdylan.com)  to  the  Yeah  Yeah  Yeahs  (www.yeahyeahyeahs.com),  post  hard-to-find  songs  for 
listening:  some  as  free  downloads,  some  as  streaming  audio  (which  can  be  recorded  with  a  free  program  like 
StepVoice  at  www.stepvoice.com).  A  next  place  to  look  is  the  labels,  particularly  independent  rock  and 
electronic  labels  like  Matador  (www.matadorrecords  .com/music/mp3s.html),  Vagrant  (www. vagrant 
.com/vagrant/audio/audio.jsp),  Barsuk  (www.barsuk  .com).  Saddle  Creek  (www.saddle-creek.com)  or 
Tigerbeat6  (www.tigerbeat6.com/html/catalogue.htm). 

Many  public  radio  stations  also  maintain  music  archives  for  streaming  or  downloading.  Among  them  are  the 
classical -music  station  WNYC  (www  .wnyc.org)  and  eclectic  stations  like  WFMU  in  Jersey  City 
(www.wfmu.org)  and  KCRW  in  Santa  Monica,  Calif,  (www.kcrw.org),  all  of  which  have  troves  of  live 
performances.  MTV  (at  www.mtv.com)  presents  an  entire  album  each  week  as  an  audio  stream. 


Following  is  a  selection  of  sites  offering  free  music  online.  Most  of  them  are  best  used  with  a  either  a 
broadband  connection  or  nearly  infinite  patience.  While  major-label  recordings  are  largely  (but  not  entirely) 
off  limits,  there's  more  than  enough  available  mi*sic  to  satisfy  every  listener. 

Epi  tonic 

The  first  and  best  place  to  look  for  any  band  with  an  independent  recording  is  www.epitonic.com,  a  superbly 
organized  site  that  is  likely  to  have  music  from  nearly  everyone  heard  on  college  radio.  It  includes  not  only 
downloadable  songs  but  also  biographical  information  and  links  for  hundreds  of  acts,  grouped  under  genres 
and  subgenres.  And  it  has  an  invaluable  "Similar  Artists"  feature  that  can  direct  fans  of  one  band  to  dozens  of 
potential  new  favorites.  Within  Epitonic's  huge  roster  is  at  least  a  song  or  two  from  some  major-label  acts, 
among  them  the  New  York  band  Secret  Machines,  the  Texas  band  Sparta  and  the  English  bands  Radiohead 
and  Spiritualized.  But  independent  bands  like  Bright  Eyes  or  Godspeed  You  Black  Emperor  are  every  bit  as 
good. 

Webjay 

At  www.webjay.org,  music  fans  share  their  Web  finds  with  the  world.  There's  no  music  on  the  site,  just  lists 
of  links  that  allow  users  either  to  play  entire  lists  or  to  download  items  directly  one  by  one;  it  also  includes 
links  to  videos  and  news  sound  bites.  Webjay  is  something  like  the  lists  submitted  by  customers  at  www 
.amazon.com,  but  with  connections  to  the  music  itself.  As  such,  it's  only  as  good  as  the  widely  varied  skills  of 
its  contributors,  and  its  links  aren't  always  dependable.  But  it  is  a  way  for  musical  obsessives  like 
bigwavedave  to  share  his  fondness  for  garage-rock  or  for  OddioKatya  to  point  listeners  toward  a  wide 
assortment  of  Brazilian  songs. 

Furthurnet 

Before  the  Internet  became  ubiquitous,  the  Grateful  Dead's  fans  built  up  their  own  network  to  exchange 
concert  recordings,  a  network  that  expanded  as  other  jam  bands  sprang  up.  The  logical  extension  of  the 
process  is  Furthurnet  (www. furthurnet  .com).  It  is  a  peer-to- peer  network  that  trades  only  recordings  of  bands 
that  encourage  listeners  to  record  concerts:  not  just  the  Dead  but  Phish,  Gov't  Mule,  Dave  Matthews  Band, 
Los  Lobos,  Wilco  and  David  Byrne  as  well.  Users  need  to  install  a  program  available  on  the  Web  site.  Most 
of  the  available  concert  recordings  don't  use  MP3  files,  but  a  better  quality  audio  format,  SHN,  which  also 
requires  some  software  installation.  It's  easy;  information  on  the  site  explains  all  the  technicalities. 

Another  connection  for  jam  bands  is  www.etree.org,  which  points  listeners  toward  recordings  stored  online 
and  is  equally  fastidious  about  high  fidelity.  Meanwhile,  concert  recordings  of  all  sorts,  from  vintage  1960's 
bootlegs  to  music  only  a  few  days  old,  have  been  traded  at  www.sharingthegroove.org,  although  the  site  is 
currently  undergoing  maintenance. 

The  Library  of  Congress 

Through  the  years,  tax  dollars  have  supported  researchers  like  Alan  Lomax  on  excursions  to  collect  music 
from  every  nook  and  cranny  and  tradition  they  could  discover  across  the  United  States.  The  Library  of 
Congress  has  made  a  considerable  amount  available  free  online.  A  place  to  start  is  the  American  Memory 
Collection  (http://memory  .loc.gov/ammem/audio.html),  with  fiddle  tunes,  American  Indian  music,  border 
music  from  the  Rio  Grande,  Dust  Bowl  songs  and  more. 

Folkways  Records 

In  1987,  the  Smithsonian  Institution  bought  the  catalog  of  Folkways  Records,  which  had  set  out  to  document 
every  sound  in  the  world  and  continues  to  support  projects  like  a  20-disc  collection  of  Indonesian  music. 


Many  of  the  Folkways  recordings  can  be  heard  on  the  Web  at  www  .folkways.si.edu,  from  "Classical  Music 
of  Iran"  to  "Creole  Music  of  Suriname"  to  "Music  of  Indonesia  Vol.  1:  Songs  Before  Dawn." 

Internet  Archive 

The  Internet  Archive  (www.  .archive.org)  has  set  out  to  preserve  material  that  might  otherwise  disappear 
from  the  Internet,  including  Web  pages,  documents,  books  and  video  clips  as  well  as  audio,  and  it  includes  a 
Live  Music  Archive  with  more  than  10,000  concerts  via  etree.org.  Most  are  from  jam  bands,  but  there  is 
plenty  to  choose  from.  (More  than  a  million  people  have  downloaded  Grateful  Dead  music  from  the  archive.) 
The  archive  also  includes  an  assortment  of  other  audio  under  All  Collections,  which  has  131  songs  from  78- 
r.p.m.  discs,  and  more  than  3,000  songs  on  what  it  calls  netlabels,  most  of  them  releasing  electronic  music. 
Try  the  exotica-tinged  selections  from  Monotonik. 

luma 

The  Internet  Underground  Music  Archive  (www.iuma.org)  was  a  pioneer  of  free  Internet  music.  It  was 
founded  in  1993  as  a  place  for  musicians  to  post  their  own  music  online,  and  it  just  keeps  on  expanding. 
Unfortunately,  it  is  both  overwhelming  and  overwhelmed;  finding  a  good  song  requires  extraordinary  luck, 
and  downloading  it  will  take  a  while.  Like  the  other  send-it-yourself  sites  noted  here,  luma  can  make  a  user 
appreciate  what  record  company  scouts  do. 

Garageband 

Hopefuls  face  Darwinian  competition  at  www.garageband.com,  where  musicians  are  encouraged  to  rate  30 
songs  before  submitting  one  of  their  own  (or  pay  a  $19.99  fee  instead)  and  other  listeners  are  also  assigned 
tracks  to  rate.  The  songs  that  rise  to  the  top  of  the  charts  have  a  chance  to  be  heard  on  Garageband's  radio 
outlets  or  collected  on  its  compilation  albums.  Garageband  demands  original  songs,  not  cover  versions,  and 
its  top-rated  ones  tend  to  sound  more  professional,  if  not  always  more  distinctive,  than  those  at  other  mass 
upload  sites. 

CNet 

The  computer  experts  at  CNet  include  an  extensive  selection  of  music  among  their  software  downloads  at 
http://music.download.com.  A  vast  bulk  of  the  music  is  submitted  by  musicians  themselves,  so  there  are  a  lot 
of  derivative  sounds  to  wade  through,  but  the  well-organized  site  also  includes  worthwhile  bands  as  Editor's 
Picks,  currently  including  Dios  and  Ex  Models. 

Vitaminic 

A  huge  site  based  in  England,  www.vitaminic.co.uk,  offers  tens  of  thousands  of  aspiring  bands  and  a 
smattering  of  better-known  acts,  although  brand-name  bands  like  Franz  Ferdinand  tend  to  offer  only 
streaming  audio  rather  than  downloads.  But  the  site  is  well  organized  and  also  includes  video  clips  from  the 
likes  of  Nick  Cave. 

BeSonic 

A  European  site  where  musicians  can  place  their  songs  online,  www 

.besonic.com  has  a  slightly  more  international  perspective  than  the  other  newcomer  sites.  Rankings  and 
recommendations  help  visitors  sift  the  material.  Registration  is  required  for  downloading. 

Pure  Volume 


More  than  76,000  songs  are  available  at  yet  another  site  for  aspiring  musicians,  www.purevolume.com,  which 
is  strongly  weighted  toward  rock.  To  winnow  the  site,  try  the  Pure  Picks  column  or  look  under  the  category 
Music  for  Top  Artists  (Signed). 

DMusic 

Musicians  can  also  post  their  own  songs  on  DMusic  (www.dmusic  .com).  It  helps  users  wade  through  more 
than  17,000  acts  —  an  overwhelming  majority  categorized  as  alternative  or  rock  —  by  listing  DM  Picks  and 
by  having  users  give  songs  a  thumbs-up  or  thumbs-down  and  append  comments.  As  with  luma,  most  are 
amateur  submissions,  with  plenty  of  jokes,  but  there  are  some  enjoyable  tracks  scattered  among  the  picks. 

Smart-Music 

Dance-music  experimenters  dominate  at  www.smart-music.net,  a  selective  site  that  draws  its  downloadable 
MP3's  from  hard-to-find  small  labels.  Dipping  into  the  genres  and  subgenres  of  electronica,  Smart-Music  has 
about  300  songs  available  from  (relatively)  well-known  groups  like  Mouse  on  Mars  and  Zero  7  as  well  as 
basement  laptop  obsessives,  and  a  high  percentage  of  them  turn  out  to  be  worthwhile. 

Ragga- Jungle 

Slow,  deep  reggae  bass  lines  are  the  foundation  for  whole  families  of  dance  music  represented  at 
www.ragga-jungle.com.  It's  an  outlet  for  amateur  and  professional  producers  and  toasters  (rappers),  and  the 
downloadable  songs,  available  free  after  registration,  include  echoey  dub-reggae  vamps,  sparse  dance-hall 
productions  and  frenetic  jungle  tracks.  Each  track  has  ratings  and  comments,  and  quick  streaming  allows 
users  to  sample  tracks  before  committing  to  a  download.  Contender  for  best  title:  "A  Waste  of  Half  an  Hour 
of  My  Life,  and  Four  Minutes  of  Yours"  by  the  Archangel. 

Classic  Cat 

With  so  much  classical  music  in  the  public  domain,  it's  a  surprise  that  there  aren't  more  free  downloadable 
sites  offering  it,  although  the  length  of  classical  compositions  can  make  them  inconvenient  to  download.  At 
www.classiccat.net,  it's  possible  to  search  by  composer,  from  Monteverdi  to  Messiaen.  The  selection  is  spotty 
and  links  don't  always  work,  but  it's  a  start. 

Asian  Classical 

Need  some  Indonesian  gamelan  music?  On  the  Internet  at  www  .asianclassicalmp3.org,  a  dedicated  collector 
of  Asian  music  has  transferred  recordings  from  cassettes  to  downloadable  MP3's.  The  site  includes  music 
from  nine  countries,  including  28  minutes  of  gamelan  music  from  Java. 

Iraqi  Music 

The  straightforwardly  named  www.iraqimusic.com  is  a  resource  for  both  the  classical  Iraqi  improvisations 
called  maqams  and  more  recent  Iraqi  recordings  based  on  traditional  (and  thus  noncopyrighted)  songs.  "Sister 
Sites"  provides  links  to  other  sites  with  Middle  Eastern  music. 

Trama 

A  Brazilian  record  label,  Trama  (www.tramavirtual.com),  offers  about  10,000  MP3's,  primarily  from  local 
Brazilian  bands.  The  site  is  in  Portuguese  and  requires  users  to  sign  up,  but  after  that,  it  is  fairly  easy  to 
navigate.  "Baixar"  means  download. 


Micromusic 

The  Internet  is  home  to  countless  obsessives.  The  ones  gathered  at  www.micromusic.net  malce  their 
electronic  music  from  the  sounds  of  the  first  primitive  video  games.  Proud  of  what  they  can  generate  from 
eight-bit  gizmos,  they  have  placed  hundreds  of  blipping,  buzzing  ditties  online,  garnering  the  attention  of 
Malcolm  McLaren,  the  Sex  Pistols'  manager,  among  others.  Registration  is  required,  but  it's  a  modest 
inconvenience  on  the  way  to  tunes  like  "How  Bleep  Is  My  Love." 
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I  Preserving  the  Past 

By  LISA  NAPOLI 

Scour  the  Web  for  resources  and  historical  information  about  Sept.  1 1,  2001,  and  you  are  likely  to  come 
up  short:  the  rapid  pace  of  online  publishing  combined  with  the  Web's  ephemeral  nature  leave  few 
traces.  Many  commemorative  sites  constructed  by  well-intentioned  Webmasters  have  been  long  abandoned. 

Brewster  Kahle  is  not  surprised.  "The  average  life  of  a  Web  page  is  100  days,"  said  Mr.  Kahle,  a  co-founder 
of  the  Internet  Archive,  a  private  effort  that  aims  to  become  the  Library  of  Congress  of  the  Internet. 
"Astonishing.  It's  statistically  easy  to  say  that  the  best  of  the  Web  is  already  gone." 

A  snapshot  of  how  the  day  unfolded  online  is  available  at  Mr.  Kahle's  site, 

web.archive.org/collections/sepl  l.html.  It  includes  pages  from  30,000  sites  from  Sept.  1 1  through  November 

2001.  A  quick  browse  feels  a  bit  like  thumbing  through  yellowing  newspapers,  without  the  paper  crackle. 

The  Defense  Department  home  page  from  two  days  after  the  attack  is  matter-of-fact,  displaying  pictures  of 
the  Pentagon.  Google  grayed  out  its  logo  and  offered  a  somber  text  headline  offering  condolences  to  victims, 
and  a  link  to  Web  resources.  The  US  Airways  home  page  reported,  without  context:  "US  Airways  Operation 
Rebounds."  There  are  hundreds  more  home  pages  frozen  in  time. 

An  even  better  sampling  is  at  (www.interactivepublishing.net/september)  with  the  front  pages  of  more  than 
250  news  sites  from  Sept.  1 1  and  12,  2001. 

"Archives  are  the  raw  materials  -  no  point  of  view,"  Mr.  Kahle  said.  "You  want  to  see  that  it  has 
comprehensiveness.  Then  you  look  at  it  and  go,  'Oh,  my  God.'  " 

Hair- Raising  Tales 

Nathan  Hemenway  is  someone  who  finds  it  relaxing  to  mess  around  on  the  computer  -  even,  in  his  case,  after 
spending  hours  each  day  in  Los  Angeles  as  a  Web  designer  and  programmer. 

Not  everyone  who  spends  time  this  way  comes  up  with  a  Web  site  filled  with  bright,  amusing  animation  that 
is  good  enough  for  television,  but  without  the  commercials.  Then  again,  not  every  would-be  animator  has 
worked  for  NBC,  the  Sundance  Film  Festival  and  an  interactive  gaming  company  called  Six  Red  Marbles. 

On  Mr.  Hemenway's  site,  www.kksbolash.org,  you  can  see  short  animations  like  the  tale  of  "The  Great 
Sardine,"  and  the  odd  story  of  a  man  named  Roy  G.  Biv.  But  the  real  star  of  the  show  is  Kristina  Krumb.  In 
one  episode,  this  plucky  little  bug-collecting  girl  with  long  dark  hair  encounters  a  Scary  Hair  Monster. 

The  episode  was  inspired  by  "the  strange  things  people  do  to  prevent  hair  from  getting  caught  in  the  drain," 
Mr.  Hemenway  said.  "It  became  a  fancy  that  the  hair  could  come  to  life.  And  the  idea  that  it  could  become  a 
secret  friend  was  also  quite  silly  to  me." 


Silly,  and  playful.  Mr.  Hemenway,  37  and  recently  married,  once  programmed  video  games.  He  sees  his  site 
as  a  place  for  experimentation  that  is  not  typical  at  most  companies. 

His  intended  audience  is  "film  festivals,  animation  enthusiasts,  artists,  writers  and  the  curious."  Chances  are 
good  that  what  happens  to  that  hair  in  Kristina's  drain  is  not  what  you'll  experience  at  home. 

Question  of  Fitness 

Those  who  practice  yoga  know  that  its  purpose  is  to  empty  the  mind,  or  at  least  to  focus  it. 

One  day  while  Daniel  Cota  was  practicing  yoga  in  Berkeley,  Calif.,  where  he  lives,  he  had  what  he  thought 
was  an  odd  sort  of  brainstorm:  "the  strange  juxtaposition  between  George  W.  Bush  and  yoga." 

Several  months  before,  on  an  impulse,  he  had  purchased  a  George  W.  Bush  doll.  When  he  got  home  from  a 
yoga  class  one  day,  he  discovered  that  the  doll  was  very  flexible.  He  photographed  it  in  various  yoga  poses, 
and  a  Web  site  was  born:  www.bushyoga.com. 

Mr.  Cota's  yoga  instructor  wrote  some  of  the  text  that  accompanies  each  pose.  It  is  clear  from  the  site  that 
neither  man  is  a  member  of  the  G.O.P.  The  entry  for  Downward  Facing  Dog,  for  example,  reads:  "To  start, 
get  on  your  hands  and  knees  with  your  toes  curled  under.  Inhale.  On  exhale  push  the  ground  away  from  you 
as  you  lift  your  hips  up.  Look  for  W.M.D.'s.  To  release,  lower  onto  your  hands  and  knees  with  an  exhale." 

Mr.  Cota,  who  speaks  with  a  voice  as  soothing  as  a  meditation  instructor,  said  that  on  Election  Day,  he  plans 
to  do  some  yoga,  have  a  beer  and  watch  the  returns  on  television.  "It's  going  to  be  an  interesting  time, 
whatever  the  results,"  he  said.  Still,  despite  his  own  practice,  he  said,  "I  don't  think  people  are  too  different  in 
the  world  if  they  do  yoga." 

A  few  calisthenics  may  not  make  much  difference  to  the  commonweal,  either.  But  at 
www.miniclip.com/kerryworkout.htm,  you  can  put  an  animated  John  Kerry  through  his  aerobic  paces. 

E-mail:  online@nytimes.com 
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Bookmobiles,  the  village  kitaabwala 

Rajiv  Theodore  |  August  16,  2004  |  13:44  1ST 

The  afternoon  call  of  the  muezzin  breaks  the  silence  in  Dadri  village  on  the  outskirts  of  Delhi.  The  villagers  stir  out  in  the 
heat  as  a  red  Mahindra  Scorpio  approaches,  throwing  up  a  cloud  of  dust  behind  it. 

"It's  the  kitaabwala,"  says  one  villager.  These  semi-rural  folk  are  watching  the  unfolding  of  a  revolution  on  wheels  that  is 
slowly  reaching  out  to  rural  India. 

Helped  by  the  crowd,  70-year-old  Roopwati  hobbles  toward  the  van  and  demands  Mohandas  Karamchand  Gandhi's  My 
Experiments  with  Truth.  The  van  doesn't  keep  a  copy  but  there's  an  easy  way  to  remedy  that.  A  command  is  given  on  a 
laptop,  the  signals  are  relayed  and  received  by  a  dish  antenna  with  KU  band. 

Then,  it's  phnted  and  bound  all  in  a  few  minutes.  For  a  little  less  than  Rs  20,  the  village  woman  gets  the  book  she  wanted 
saving  an  arduous  journey  possibly  to  a  library  or  bookshop  in  nearby  Delhi.  Welcome  to  the  world  of  Digital  Bookmobiles. 

In  the  United  States  Kahle's  'Bookmobiles'  started  by  Brewster  Kahle,  a  digital  librarian,  are  gaining  popularity  rapidly.  A 
book  like  Alice  in  Wonderland,  for  instance,  is  available  for  a  dollar  and  a  copy  can  be  phnted  in  10  minutes. 

"Books  are  the  key  to  knowledge  but  they  are  no  use  if  we  hold  on  to  it.  Therefore,  the  moral  of  the  story  is  digitise  and 
replicate,"  says  Dr  Om  Vikas  who  heads  the  Digital  Library  of  India,  Initiative,  under  the  Department  of  Information 
Technology.  India  has  a  multiplicity  of  languages,  schpts,  manuschpts  and  fonts.  "This  forms  a  vast  treasure  of  hehtage," 
says  Vikas. 

Obviously,  books  are  only  the  tip  of  the  iceberg  in  terms  of  what  is  possible.  It  could  be  research  tools,  photographs,  music, 
market  information,  trading,  remote  customer  interaction,  e-tutoring,  e-publishing  and  even  book  fairs  -  the  list  is  endless. 
"Universal  access  to  all  human  knowledge"  is  Kahle's  ambitious  goal. 

How  it  works  is  that  a  book  or  manuschpt  is  first  scanned  by  a  high-end  Minolta  BS  7000  scanner,  (one  hundred  of  them 
were  recently  donated  by  the  Carnegie  Mellon  University,  followed  by  a  "cropper"  treatment  whereby  all  unwanted  stains  or 
needless  images  on  the  original  text  are  deleted.  Before  being  put  on  the  web  the  manuschpt  passes  through  indigenously 
developed  software  called  the  Optical  Corrector  Recognizer,  available  currently  in  seven  Indian  languages. 

It  is  a  concept  many  developing  counthes  like  China  and  Egypt  have  also  taken  up  enthusiastically.  But  India  has  taken  a 
huge  lead  already.  Thirty  Bookmobiles  will  soon  be  on  the  road.  Two  are  already  dhving  around  Delhi's  fhnge  villages 
adjoining  Uttar  Pradesh.  Phase  two  of  the  project,  which  is  scheduled  to  take  off  soon,  will  cover  Uttar  Pradesh,  Delhi, 
Punjab,  Haryana,  Rajasthan  and  Madhya  Pradesh. 

"By  2008  we  will  have  covered  the  country",  says  V  N  Shukia,  director.  Special  Applications  at  the  Pune-based  Centre  for 
Development  of  Advanced  Computing,  which  is  executing  this  government  sponsored  project  under  the  aegis  of  the 
Ministry  of  Communications  and  Information  Technology.  The  project  first  went  on  the  road  in  January  2003. 

Shukia,  who  sometimes  accompanies  the  Scorpio  or  the  other  digital  library,  a  Maruti  Versa,  says  it  is  a  daunting  task  to 
manage  a  milling  crowd  of  more  than  200  people  at  any  given  time,  who  jostle  for  attention  when  they  arhve  at  a  village.  "It 
has  become  immensely  popular,"  he  says. 

The  IT  Ministry  had  given  Rs  1  crore  (Rs  10  million)  for  the  first  phase  and  another  Rs  5  crore  (Rs  50  million)  will  be 
allocated  for  the  second  phase.  An  investment  of  a  couple  of  lakhs  is  all  the  van  needs  ~  a  phnter,  a  cutter,  a  binder  and  a 
satellite  dish  for  downloading,  says  Shukia.  It  takes  only  about  10  minutes,  from  start  to  finish  to  create  a  perfect  bound 
book. 

"It  could  be  the  most  expensive  book  too,  but  when  downloaded  would  cost  a  fraction  of  its  phnt  version,"  he  said.  More 
than  60,000  books  have  already  been  scanned  and  another  100,000  have  been  sent  to  India  by  Kahle. 

Manufacturers  Association  of  Information  Technology  President  Vinnie  Mehta  said  the  new  concept  is  extremely  beneficial 
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for  the  rural  masses  where  even  education  courses  can  be  downloaded  and  distributed  where  no  schools,  colleges  or 
teachers  exist.  "It  is  time  we  moved  away  from  the  urban  areas,"  says  Mehta. 

The  Million  Book  Project,  a  joint  venture  between  India,  China,  the  Carnegie  Mellon  University  and  Kahle's  Internet  Archive 
is  an  offshoot  of  this  new  technology.  The  project  is  further  linked  to  Kahle's  e-books  -  1 7  million  of  them.  The  christening 
of  the  project  took  its  cue  from  Kahle's  unassuming  legend  painted  on  his  vans,  "1 ,000,000  Books  Inside  (soon)".  The 
project  is  set  to  digitise  one  million  public  domain  books  and  make  them  available  in  scanned  format  for  anybody  for  free  by 
next  year. 

After  graduating  from  the  Massachusetts  Institute  of  Technology  in  1982,  Kahle  designed  supercomputers  at  Thinking 
Machines  and  later  invented  the  Wide  Area  Information  Servers,  which  was  the  net's  first  publishing  system. 

During  the  1990s,  WAIS  got  commercial  and  government  publishers  -  among  them  the  New  York  Times,  Encyclopedia 
Britannica  and  the  US  government's  printing  Internet  archive.  After  selling  WAIS  to  AOL,  Kahle  founded  the  Internet 
Archive,  a  non-profit  company  devoted  to  archiving  and  cataloging  millions  of  websites. 

The  Bookmobile  is  one  of  the  latest  offerings  from  Internet  Archive.  Kahle  said  that  1 00,000  books  have  been  sent  to  India 
where  they  are  now  being  scanned.  "They  (Indian  government)  see  that  for  the  cost  of  scanning  a  book  they  can  make  it 
available  to  the  entire  country.  So  they  are  scanning  up  a  storm,  with  a  goal  of  1  million  books.  In  China,  the  same  thing  is 
happening.  They  are  going  to  scan  100,000  books.  The  dream  is  of  a  library  where  you  can  have  access  to  all  the  world's 
knowledge,"  Kahle  said  in  a  conversation  to  a  website. 

The  giant  question  is  how  much  this  can  be  expanded.  Information  Technology  Secretary,  Kamal  Kant  Jaswal  will  soon 
review  the  entire  project  looking  particularly  at  scaleability.  "That  is  a  weak  link,"  says  Jaswal.  He  believes  that  the  project 
needs  backing  from  a  private  entrepreneur  to  ensure  it  grows  rapidly.  "We  have  to  find  someone  who  can  champion  the 
cause  of  the  project,"  he  says. 

A  senior  librarian  at  New  Delhi's  new  Parliament  library  says  costs  can  easily  be  kept  under  control.  Scanning  a  book  in 
India  costs  $4  compared  to  between  $20  and  $25  in  the  US.  Part  of  the  funding  for  public  library  systems  in  India  could  be 
canalised  to  create  a  brand  new  e-library  and  make  it  available  to  the  remotest  corner  of  the  country  for  just  Rs  20. 

Obviously,  there's  an  issue  of  copyright.  But  some  librarians  suggest  that  the  project  should  stick  to  non-copyright  books  for 
the  time  being.??  "Why  should  we  bother  about  it  when  we  have  not  even  digitised  all  the  public-domain  books  yet,"  he 
says. 

Vikas  says  that  15,000  books  in  local  language  have  already  been  scanned.  "We  have  urged  in  different  forums  to  reduce 
copyright  from  the  present  60  years  to  25  years  so  that  more  books  are  available  to  the  people  and  finally  help  bridge  the 
digital-divide  that  is  plaguing  the  country." 
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Elvis  has  left  the  building 
-  time  to  free  his  works 
too 


The  international  recording  industry  is  preparing 
to  lobby  the  EU  for  changes  to  existing  copyright 
law.  But  in  an  attempt  to  manipulate  law  for  profit, 
we  are  being  asked  to  place  whole  chunks  of  our 
culture  into  a  commercial  vacuum.  Becky  Hogge 
comments. 

Twenty-seven  years  after  his  death,  Elvis  Presley  is  still  climbing 
the  charts.  The  track  that  is  credited  with  giving  birth  to  rock  and 
roll.  That's  All  Right,  reached  number  three  last  month,  fifty  years 
after  it  was  first  recorded  in  Memphis  '  Sun  Studios. 

For  most  record  buyers  the  track  was  just  a  nostalgic  trip.  But  to 
record  industry  officials  it  is  a  "call  to  arms".  For  on  1  January 
2005,  this  seminal  recording  will  drop  out  of  copyright  and  into  the 
public  domain. 

Undercurrent  EU  law,  sound  recordings  are  classified  as 
"performance"  and  copyrighted  for  a  period  of  50  years.  This  is  not 
to  be  confused  with  compositions,  which  remain  in  copyright  for 
the  artist's  lifetime  plus  70  years,  preventing  others  from  covering 
or  sampling  the  track  without  paying  some  royalties. 

Nevertheless  what  this  law  does  mean  is  that,  from  January, 
anyone  may  store,  share,  swap  or  commercially  release  That's  All 
Right  without  recourse  to  RCA,  who  currently  own  rights  to  the 
track  as  part  of  their  back  catalogue.  Further,  over  the  next 
decade  and  beyond,  other  such  seminal  recordings  -  from  Chuck 
Berry  to  Johnny  Cash  and,  eventually.  The  Beatles  -  will  come  into 
the  public  domain. 

Faced  for  the  first  time  with  losing  significant  back  catalogue 
profits,  the  industry  is  lobbying  to  change  the  law.  The  industry 
describes  the  law  as  a  "loophole".  In  fact  it  is  anything  but. 

For  every  one  recording  that  has  the  power  to  reach  number  three 
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in  the  commercial  charts  fifty  years  after  its  original  release,  there 
are  hundreds  if  not  thousands  of  tracks  that  do  not. 

Although  these  recordings  no  longer  have  any  commercial  value 
to  their  rights  holders,  they  are  of  tremendous  value  in  terms  of 
our  cultural  heritage.  But  the  mechanisms  of  copyright  law  mean 
that,  should  the  European  Parliament  choose  to  heed  the  music 
industry,  keeping  Elvis  out  of  the  public  domain  for  a  further  45 
years  or  even  more,  the  King  will  drag  down  with  him  this  huge 
body  of  commercially  worthless  but  culturally  significant  work. 

Works  of  no  commercial  value  will  be  orphaned,  languishing  in 
forgotten  store  cupboards  at  record  company  headquarters  when 
they  could  be  enjoying  a  digital  rebirth  in  the  public  domain. 

Brewster  Kahle  is  head  of  the  Internet  Archive,  an  American 
organisation  which  aims  to  store  a  digital  record  of  every  web 
page,  TV  broadcast,  radio  program,  book,  song,  film  -  every 
product  of  human  culture  they  can  lay  their  hands  on  -  in  a 
high-profile  project  likened  to  the  building  of  the  ancient  library  at 
Alexandria. 

His  team  are  currently  digitising  500,000  old  78s  to  add  to  the 
library.  In  a  recent  appearance  in  the  UK  at  the  technology 
conference  NotCon,  Kahle  urged  attendees  to  safeguard  the  EU 
performance  copyright  law. 

"A  couple  of  these  tracks  may  be  of  commercial  value,"  says 
Kahle,  but  in  leaving  sound  recordings  under  copyright,  "we  are 
taking  away  all  of  early  20th  century  culture  from  our  use.  We  will 
subject  our  children  and  grandchildren  to  only  listening  to  the 
current  pop  hits  and  not  being  able  to  learn  from  the  past." 

The  advantages  of  digital  archives  are  clear.  Not  only  is  it  possible 
to  store  far  greater  volumes  of  information  digitally  than  in  "hard 
copy",  but  unlike  the  ancient  library  at  Alexandria,  digital  libraries 
are  less  likely  to  burn  to  the  ground. 
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If  the  freedom  of  a  society  is  defined  by  its  access  to  information, 
then  libraries  play  a  central  role  in  democracy.  When  pumped-up 
copyright  laws  start  undermining  the  ability  of  these  new  digital 
libraries  to  function,  then,  as  we  move  further  into  the  digital  age, 
question  marks  may  start  to  appear  over  the  shape  of  free 
speech. 

Indeed  Kahle  is  presently  suing  US  Attorney  General  John 
Ashcroft  under  the  claim  that  modern  copyright  law  runs  in  breach 
of  the  US  First  Amendment  right  to  free  speech. 

He  also  argues  that  extending  copyright  terms  across  the  board, 
without  demanding  whether  the  original  authors  actually  value  the 
copyright  anymore  through  some  kind  of  registration  system, 
effectively  traps  the  majority  of  works  -  those  not  commercially 
viable  for  re-release  yet  not  allowed  into  the  public  domain. 

The  public  domain  isn't  just  a  repository  for  our  cultural  heritage  - 
its  richness  provides  inspirational  material  which  in  turn  leads 
artists  to  create  new  works.  Where  would  Disney  be  without  the 
success  of  translating  age-old  fairy  tales  like  Cinderella  and  Snow 
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White  into  the  captivatingjiew  format  of  animation? 

Yet  the  US  1998  Sonny  Bono  Copyright  Term  Extension  Act 
(CTEA)  was  arguably  a  direct  result  of  the  Disney  Corporation's 
lobbying  of  Congress  to  protect  Mickey  Mouse  from  falling  into  the 
public  domain  on  his  75th  birthday  in  2003. 

"By  repetitively  extending  copyright,  the  question  that  comes  up  is: 
will  this  ever  stop?"  asks  Kahle.  "The  authors  of  the  last  copyright 
extension  in  the  United  States  ...  publicly  stated  that  they  were 
interested  in  copyright  lasting  'one  day  short  of  eternity'.  This 
statement  seems  consistent  with  the  strategy  that  is  unfolding 
among  American  lobbies." 

Copyright  law  was  originally  framed  with  this  in  mind;  that 
individual  artists  should  be  able  to  create  works  and  have  those 
works  as  their  exclusive  property  was  the  incentive  for  artists  to 
keep  on  creating.  Yet  even  Elvis  himself  owes  a  great  debt  to 
early  black  rhythm  and  blues  musicians  who  contributed  to  his 
unique  style. 

Since  copyright  law  was  originally  framed,  industries  like  the 
record  industry  which  publish  creative  works  have  grown  and 
merged  to  become  global  mega-corps.  Yet  despite  their  size,  the 
internet,  with  its  unbridled  ability  to  disseminate  a  wide  range  of 
cultural  products  at  zero  cost,  presents  a  huge  threat  to  these 
industries. 

Copyright  law  is  their  only  weapon  against  the  new  technology  of 
the  internet,  but  they  have  the  muscle  to  use  it  to  great  effect. 
When  the  printing  press  first  enabled  the  distribution  of  bibles  to 
the  masses,  the  incumbent  authority  at  the  time  -  the  religious 
establishment  -  was  also  "called  to  arms". 

If  we  allow  incumbent  publishing  industries  to  stunt  the  growth  of 
the  internet  and  secure  their  future  profits  by  manipulating 
copyright  law  and  waging  assaults  on  the  public  domain,  who 
knows  what  cultural  renaissance  we  may  be  missing  out  on? 

The  shift  away  from  the  public  domain  towards  copyright 
perpetuity  should  be  the  public's  "call  to  arms".  Says  Kahle: 
"Everything  we  do  is  built  on  the  past.  If  we  effectively  take  the 
past  away  from  us,  then  we're  living  in  a  perpetual  present.  This  is 
only  a  world  George  Orwell  might  smirk  at." 

■  Becky  Hogge  is  a  freelance  writer  &  journalist. 

Comment  on  this  article. 
Links: 

■  The  Internet  Archive. 

■  The  International  Federation  of  the  Phonographic  Industry  (IFPI) 
makes  the  case  for  copyhght  protection. 

■  Billboard  magazine,  via  MSNBC  report:  "Extension  of  the  term 
of  duration  of  recording  rights  is  the  music  industry's  main  priority 
on  the  legislative  agenda  in  Europe." 

■  NotCon  'Minimalist  Website'. 
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■  The  Electronic  Frontier  Foundation,  the  US-based  not-profit 
group  which 

works  to  protect  people's  digital  rights. 

■  Another  take  on  the  issue  from  the  London  News  Review. 

■  The  saga  surrounding  Los  Angeles  DJ  Danger  Mouse's 
experimental  CD  blending  tracks  from  rapper  Jay-Z's  Black  Album 
with  the  Beatles'  White  Album  as  reported  by  Index  on 
Censorship. 
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Sheila  Lennon:  Open 
Media  project  hopes  for 
your  videos 

August  11,  2004 

By  SHEILA  LENNON  /  The  Providence  (R.I.) 
Journal 

7:05  p.m.  Tuesday  (Blogroll) 

Open  Media  project  hopes  for  your  videos 

J.  D.  Lasica  and  Marc  Canter  have  teamed  up  to  make  a 
grassroots  video  concept  a  reality.  Marc  describes  it: 

...  enable  folks  to  access  the  HUGE 
repositories  of  public  domain  and  Creative 
Commons  content  —  that's  out  there. 

And  to  help  build  our  own  huge  repository 
of  CC  content. 

First,  we'll  start  off  with  upload  sites  — 
which  will  enable  folks  to  start  getting  their 
stuff  into  the  'archives.'  Then  we'll  provide 
Jukeboxes  and  Image  Albums  (much  like 
what's  in  the  gutter  of  my  blog)  that  have 
built  into  them  these  huge  repositories. 

Basically  we're  making  sure  to  make  it 
REAL  easy  for  folks  to  utilize  media  in  their 
everyday  lives,  school  and  work. 


J.D.  says. 


Open  Media  is  three  things  in  one: 

•  an  open-source  platform  to  bring 
personal  media  to  the  desktop; 

•  a  destination  Web  site,  to  launch  soon  at 
www.open-media.org; 

•  eventually,  it  will  evolve  into  a 
not-for-profit  organization  dedicated  to 
advancing  amateur,  hobbyist,  and 
semi-professional  visual  works  licensed 
under  a  Creative  Commons  license. 
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Unlike  other  initiatives  that  are  pure-play 
stand-alone  Web  sites,  Open  Media's  vision 
is  to  bring  personal  media  to  millions  of 
desktops  through  playlists,  video 
jukeboxes,  visual  albums,  and  built-in 
media  libraries. 

Brewster  Kahle  and  his  Internet  Archive  are 
supporting  this  project  with  free  storage 
and  bandwidth  for  grassroots  video  works. 

J.D.  also  has  a  nice  list  of  sites  now  archiving  homebrew 
video,  (if  you  know  of  others,  I'm  sure  he'd  like  you  to  leave 
a  link  in  the  comments  on  this  post.) 

Okay,  it's  "an  open-source  platform  to  bring  personal  media 
to  the  desktop."  What  does  that  mean  to  you  and  me? 

I  asked  J.D.  some  questions  I  have  a  hunch  might  come  up 
around  some  dinner  tables.  Here  they  are,  with  his  replies: 

•  Can  I  put  my  home  movies  up  there 
and  send  the  link  to  my  family? 

Yep.  But  we're  not  yet  sure  if  there  will  be 
private  spaces  or  if  everything  "up  there" 
will  be  part  of  the  public  archive,  accessible 
to  all. 

•  I  want  to  create  an  archive  of  every 
Bush/Kerry/Cheney/Edwards 
campaign  speech,  enlisting 
volunteers  who  live  along  their 
routes.  Can  you  help  me  organize 
that? 

You  mean  the  text  of  the  speeches  instead 
of  the  video? 

No,  I'd  like  to  see  vblogging  by  people 
who  go  to  campaign  appearances,  or 
by  the  campaigns  themselves.  Primary 
sources.  Like  this  little  clip. 

Yeah,  this  is  the  sort  of  thing  that  could  be 
stored  on  the  open-media.org  site  we're 
building.  (What's  nice  for  users  is  there's 
no  bandwidth  costs.)  Then  all  you  have  to 
do  is  create  a  website  with  whatever 
branding  and  text  you'd  like,  and  have  a 
little  screen  for  video  footage,  point  it  to 
our  servers  to  deliver  the  video  stream  or 
download,  and  you're  all  set. 

•  Do  you  own  what  I  put  there,  as 
TextAmerica  does? 

No.  Anyone  who  uploads  anything  will  be 
required  to  fill  out  a  form  stating  which 
Creative  Commons  license  she  chooses, 
which  typically  means  the  work  may  be 
freely  shared  and  viewed  by  others  but 
must  be  attributed  to  the  content  creator. 
Or,  full  copyright  can  be  retained. 

•  Can  I  upload  video  from  my 
phonecam? 

Don't  know.  I'll  ask  around. 
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•  Is  this  like  public-access  cable, 
where  I  can  do  a  Wayne's  World  in  my 
basement  and  upload  it? 

If  you'd  like!  Hopefully,  people  will  create 
works  with  a  bit  more  meaning,  but  it  will 
run  the  gamut  from  deeply  felt  digital 
stories  to  satirical  bits  of  entertainment. 

•  I'm  in  Iraq.  How  can  I  get  my  war 
footage  to  you? 

Our  servers  should  be  accessible  to  anyone 
in  the  world  with  an  internet  connection, 
once  the  site  goes  live. 

•  Can  I  just  store  my  footage  there 
without  making  it  available  to 
anybody  else? 

Probably  not.  It's  not  for  storage,  it's  for 
sharing,  just  as  a  library  is. 

•  Am  I  gonna  be  able  to  search  for 
other  people's  cat  videos?  By  breed? 

Yes.  We'll  have  keyword  searches  and  a 
rich  metadata  library. 

•  Can  I  put  porn  videos  there? 

No.  We'll  have  terms  of  service  and  porn  is 
one  of  the  no-nos. 

•  When  does  the  Web  site  start? 

We're  shooting  for  sometime  next  month. 

•  I'd  like  some  recommendations  for 
tools  for  participation.  I've 
deliberately  held  back  till  I  could  get 
the  Swiss  army  knife 

phone/ web/ wifi/ video/email/ moblog 
appliance. 

We'll  have  a  page  up  within  the  week, 
announcing  the  project. 

The  first  video  won't  go  up  until  next 
month  at  the  earliest  when  the  site  and  its 
functionality  are  built  out. 

We'll  be  able  to  make  recommendations 
for  tools,  too,  within  a  few  weeks.  Wheeel 


A  glance  at  politics 

This  campaign  season  is  going  to  go  on  way  too  long.  Like  a 
radio  broadcast  of  a  tennis  match,  it's  already  come  down  to, 
"He  hit  it,  he  hit  it,  he  hit  it..." 

Two  things:  Complete  transcript  of  Stars  and  Stripes' 
interview  with  Sen.  John  Kerry.  I  hope  they  also  do  one  with 
George  W.  Bush. 

And  a  disturbing  quote  from  a  Chicago  Sun-Times  story  about 
Alan  Keyes  (Keyes  fires  up  GOP  faithful),  but  it's  more 
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generic  than  that: 

...In  fact,  Republican  Jim  Oberweis,  who 
flanked  Keyes  onstage  along  with  other 
GOP  candidates  who  lost  in  the  primary  to 
Ryan,  called  the  race  with  Obama  no  less 
than  "a  debate  between  good  on  the  right 
and  evil  on  the  left."... 

We  are  all  Americans.  There  is  an  Axis  of  Evil  and  it  does  not 
include  your  fellow  voters.  If  politicians  want  to  start  another 
civil  war  in  this  country,  keep  talking  this  way. 


Comic  relief:  Moore/Bushi 

Like  the  JibJab  video,  which 
lampoons  both  sides  equally  to 
the  tune  of  "This  Land  is  My 
Land...,"  Moore/Bush  casts  the 
filmmaker  and  the  president  in 
zany  lyrics  to  the  tune  of 
"She'll  Be  Comin'  Round  the 
Mountain." 
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She  uses  the  planets  to  predict  the  sun 

There's  a  nice  story  about  Rhode  Island  astrometeorologist 
Carolyn  Egan  (Weather  Sage  is  her  site)  in  The  Bristol 
Phoenix: 

...Carolyn  Egan  is  not  your  typical 
meteorologist.  It's  true,  she  does  predict 
the  weather  —  but  not  using  the  methods 
of  today's  meteorologists,  with  Doppler 
radars,  enhanced  satellite  imagery  and 
computer  models.  Ms.  Egan  is  an 
astrometeorologist.  She  uses  astronomy 
and  astrology  to  predict  weather  patterns 
across  the  globe.  And  she's  usually  right. 

"It  works  so  well  with  skill  and 
experience,"  she  said. 

The  desk  in  her  office  —  tucked  in  a  corner 
of  the  new  Bristol  in-law  apartment  she 
shares  with  her  husband  —  is  littered  with 
maps,  diagrams,  binders  and  a  computer 
screen  with  a  map  of  the  western 
hemsiphere  displayed  predominantly  on  it. 
Mrs.  Egan,  who  is  retired,  spends  about  40 
hours  a  week  on  weather  forecasting, 
mixing  her  hobby  —  she  does  make  some 
money  through  the  classes  and  workshops 
she  holds  —  with  helping  to  care  for  her 
grandchildren. 

...  Through  the  years,  Mrs.  Egan  has 
become  so  accurate  in  her  predictions,  she 
has  occasionally  been  hired  to  predict  the 
weather  for  special  events  —  like  weddings 
and  outdoor  parties  —  a  year  in  advance. 
She  hasn't  been  wrong  yet,  she  said. 

"I  achieve  80  to  90  percent  (of  my  weekly 
predictions)  now  and  100  percent  for  my 
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Internet  archivist  lias  modest  goal:  Store  everything 

By  Matt  Marshall 
Mercury  News 

Brewster  Kahle,  founder  of  San  Francisco's  Internet  Archive,  burns  with  a  mission.  He  wants  to  ensure  universal  access 
to  all  human  knowledge.  And  now  he  thinks  that  goal  is  within  our  grasp. 

The  emergence  of  cheap  data  storage  technology  has  made  what  once  seemed  a  pipe  dream  distinctly  possible  — 
digitizing  and  storing  the  entire  Web,  the  world's  100  million  books,  2  or  3  million  audio  recordings  and  millions  more 
software  programs,  TV  shows  and  videos. 

■  'Storing  this  is  a  no-brainer,"  Kahle  said. 

He's  making  what  has  been  digitalized  so  far  freely  accessible  at  www.archive.org.  And  he's  built  an  ' "  Internet 
Bookmobile,"  a  van  that  drives  around  the  country  downloading  public-domain  books  from  the  archive  via  a  satellite 
Net  link.  The  van  recently  got  kicked  out  of  Walden  Pond  in  Massachusetts  for  giving  away  Henry  David  Thoreau  books 
and  upsetting  local  book  merchants.  He  has  also  taken  the  Bookmobile  to  places  that  really  need  it  —  Uganda,  Egypt, 
India  --  printing  out  books  for  children  at  $1  a  piece. 

And  then  there's  the  archive's  newer  offings  --  15,000  music  concerts  and  300  feature  films. 

The  archive  recently  hired  an  engineer  to  design  an  affordable  Petabox,  which  stores  a  million  gigabytes  of  data.  The 
Petabox  will  be  ready  in  the  fall. 

Brewster's  goal:  Store  everything.  '  '  It  is  possible,"  he  proclaimed  last  week  at  a  conference  at  IBM's  Almaden  Research 
Center  in  San  Jose.  ' "  It  could  be  one  of  the  greatest  achievements  of  all  time." 

The  Library  of  Congress  houses  about  28  million  books,  and  he  estimates  he  can  scan  and  digitize  each  book  for  $10  a 
piece.  That  would  cost  about  $280  million,  or  the  equivalent  of  half  the  Library's  annual  budget. 

The  Web  is  growing  at  about  20  terabytes  of  compressed  data  a  month,  which  is  manageable,  Kahle  said. 

Sure,  getting  copyrighted  material  has  its  challenges,  especially  music  and  videos.  But  he's  chipping  away  where  he 
can. 

Driving  Kahle  is  the  conviction  that  the  world's  information  is  a  common  good. 

In  that  spirit,  he  also  has  asked  Google  to  furnish  him  with  a  copy  of  its  database,  say  with  a  six-month  delay  so 
Google's  competitiveness  doesn't  suffer. 

Google  has  yet  to  grant  his  request.  But  Kahle  hopes  the  company  will  come  around,  especially  in  light  of  its  claim  that 
it  wants  to  have  a  positive  impact  on  the  world.  A  Google  spokeswoman  declined  to  comment. 

He  learned  a  tough  lesson  when  search  engine  Infoseek  initially  agreed  to  give  him  a  copy  of  its  database.  When 
Infoseek  went  bankrupt,  though,  the  lawyers  didn't  follow  through.  So  Kahle  is  adamant  that  Google  should  act  soon. 

Let's  give  it  to  others,"  he  says. 
Contact  Matt  Marshall  at  mmarshall@mercurynews.com  or  (415)  477-2518. 
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Internet  Archive  has  copyright  problems 

DMCA  exempt  for  now 

By  Nick  Farreli:  Wednesday  11  August  2004,  07:17 

THE  DIGITAL  Millennium  Copyright  Act  is  proving  a  headache  for  those  hoping  to  preserve 
software  and  data. 

The  US  Internet  Archive,  which  makes  archival  copies  of  software  and  data,  said  it  was 
technically  impossible  to  do  its  job  because  of  the  Act  which  forbids  copying  software. 

Because  the  life  of  a  magnetic  disk  is  only  10  to  30  years,  the  Archive  would  have  to  copy 
the  stuff  every  few  years  to  preserve  it  which  would  be  illegal. 

This  week  the  group  announced  on  its  site  here  that  the  Copyright  Office  has  ordered  a 
temporary  exemption  for  the  group's  work. 

While  this  allows  the  Archive  to  carry  on  its  Stirling  work,  the  decision  is  up  for  review  in 
2006.  Hopefully  by  then  the  DMCA  will  have  been  extensively  reviewed  or  repealed.  |j 
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SAN  JOSE,  Calif.--The  vast  corpus  of  human 
knowledge  could  soon  be  published  on  the  Internet. 
The  problem  now  is  how  to  wade  through  it. 

Although  search  engines  have  greatly  enhanced  access 
to  information,  and  storage  technology  has  made  it 
cheap  to  digitize  nearly  everything,  search  tools  need  to 
be  refined  to  make  it  easier  to  digest  information  or 
conduct  queries.  That  was  the  word  from  researchers 
and  speakers  at  the  New  Paradigms  for  Using 
Computers  Conference,  held  at  IBM's  Almaden 
research  lab  here  last  week. 

News.context 


What's  new:  ^^^^^^^^^^^^^^^^^^^^^M 

Scientists  are  working  on  next-generation  search 

engines  and  tools  so  users  will  be  able  to  pick  through  the  data  on  their  hard  drives  and  the  Web. 

Bottom  line: 

The  amount  of  digital  information  is  exploding,  and  unless  inventions  bubble  up,  we  could  get 
lost  in  the  morass. 
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More  stories  on  this  topic 


"We  live  in  a  world  with  lots  of  information  but  also  lots  of  interruptions.  It  is  a  teriyaki  of  information.  The 
question  is,  'How  do  we  survive  in  the  marinade?'"  joked  Dan  Russell,  senior  manager  of  user  sciences 
and  experience  research  at  IBM  Almaden. 

Early  attempts  to  better  locate  the  world's  information  are  already  under  way.  The  University  of  California 
at  Berkeley,  for  example,  showed  off  at  the  conference  a  prototype  of  a  search  engine  called  Flamenco 
that  makes  it  easier  to  search  for  works  of  art  or  antiques.  Santa  Clara,  Calif.-based  Inxight,  meanwhile, 
has  created  software  that  attempts  to  graphically  represent  latent  connections  between  people  or 
Institutions  by  studying  where  and  how  they  get  mentioned  on  the  Web. 

On  the  desktop,  companies  such  as  Ingenuity  Software,  founded  by  former  Apple  Computer  developer 
Bruce  Horn,  are  creating  tools  designed  to  make  it  easier  for  people  to  index  their  photos  and  documents 
for  subsequent  Google-like  searches  on  their  hard  drive. 

These  research  efforts  are  in  addition  to  new  operating  systems  under  development  that  will  include  better 
search  tools. 
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Microsoft  plans  to  add  better  search  features  to  a  future  version  of  Windows,  code-named  Longhorn,  due 
sometime  around  2006  or  2007.  The  software  giant  last  week  demonstrated  a  more  general  Web  search 
"service"  that's  also  in  development. 

And  Apple's  Tiger,  a  new  version  of  the  company's  Mac  OS  X  operating  system  that's  due  next  year,  will 
include  a  new  systemwide  search  engine  called  Spotlight  that  will  allow  Mac  users  to  quickly  search  and 
find  any  file,  Apple  says. 

How  many  books? 

One  of  the  surprises  that  has  emerged  from  the  Internet  Archive,  which  is  intended  to  become  a  repository 
of  everything  ever  published,  is  that  the  body  of  public  works  can  probably  be  corralled,  said  Brewster 
Kahle,  founder  of  the  organization. 

About  100  million  different  books  have  been  published  in  history,  Kahle  said,  citing  estimates  from 
professor  Raj  Reddy  at  Carnegie  Mellon  University.  About  28  million  sit  in  the  Library  of  Congress.  On 
average,  a  book  can  be  condensed  to  a  megabyte  in  Microsoft  Word.  Thus,  the  books  in  the  Library  of 
Congress  could  fit  into  a  28-terabyte  storage  system. 

"For  the  cost  of  a  house,  you  could  have  the  Library  of  Congress,"  Reddy  said,  adding  that  mass 
book-scanning  projects  are  currently  underway  in  India  and  China. 

Only  about  2  million  to  3  million  audio  recordings-mostly  music-have  ever 

been  published  for  public  consumption.  The  Internet  Archive  has  begun  to  "Universal  accesS 

store  digitized  recordings  of  concerts  as  well  and  has  about  15,000  shows  in  *q  aii  hijrnari 

its  database  to  date.  There  are  between  100,000  to  200,000  theatrical 

movies-half  of  them  from  lndia~in  existence  and  about  20  terabytes  of  TV  knowledge  IS 

broadcasts  a  month.  The  Web  grows  by  about  20  terabytes  of  compressed  within  OUT  SraSD. 

data  a  month  as  well.  (One  terabyte  equals  1  trillion  bytes.)  Since  1984,  i  j  u                 i 

about  50,000  software  titles,  including  CD-ROMs,  have  emerged.  't  COUld  be  One  Of 

the  greatest 

Though  the  legal  issues  around  storing  and  viewing  all  this  information 

remain  thorny,  storing  it  is  doable.  achievements  Of 


"Universal  access  to  all  human  knowledge  is  within  our  grasp,"  Kahle  said. 
"It  could  be  one  of  the  greatest  achievements  of  all  time." 


all  time." 

Brewster  Kahle,  founder, 
Internet  Archive 


Still,  that's  a  lot  to  grasp.  Similarly,  individuals  will  experience  an  explosion  in  their  personal  catalogs  of 
data.  In  the  MyLifeBits  project  under  way  at  Microsoft  Research,  noted  scientist  Gordon  Bell  is  attempting 
to  digitally  capture  all  of  the  books,  movies,  TV  shows,  music  and  other  media  he  has  experienced  in  his 
life.  He's  up  to  44GB  of  data  so  far. 

E-mails,  phone  messages,  photographs  and  personal  video  will  also  add  to  an  individual's  data  trove.  In 
another  experiment,  doctors  in  Cambridge,  England,  have  equipped  patients  suffering  from  severe 
memory  loss  with  a  Microsoft  SenseCam,  a  wearable  camera  that  takes  pictures  when  a  person  moves. 
One  man  is  currently  using  it  so  he  can  show  his  wife,  who  has  memory  problems,  a  diary  of  the  day,  said 
Ken  Wood,  who  works  on  the  project. 

Microsoft  has  also  entered  a  three-year  alliance  with  the  Edinburgh  International  Festival  in  Scotland.  In  a 
likely  experiment,  attendees  will  wander  about  the  arts  fest  with  SenseCams  around  their  necks,  snapping 
shots. 

Hide  and  seek 

One  approach  to  mastering  data  overload  lies  in  developing  search  engines  specialized  for  certain  topics 
and  data  sets.  That's  the  tack  taken  by  Berkeley's  Flamenco  project. 

In  Flamenco,  a  Yahoo-like  interface  categorizes  artworks  drawn  from  museum  collections  around  the  world 
by  content  (animals,  heaven  and  earth,  shapes  and  colors,  and  so  on),  century,  artist,  medium  (such  as 
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painting,  furniture,  sculpture)  and  other  identifiers.  By  going  up  and  down  the  tree,  users  can  browse 
through  all  the  animal  pictures  found  in  the  database,  or  they  can  zero  in  on,  say,  the  years  1700  to  1709 
and  discover  that  the  period,  at  least  as  represented  by  the  database,  produced  only  four  paintings  of 
hoofed  mammals. 

The  search  engine  does  not  search  on  the  visual  information  contained  in  the  picture,  said  Kevil  Li,  a 
student  on  the  project.  Instead,  searches  are  conducted  on  descriptive  text  submitted  by  the  museums  that 
digitize  their  artwork  for  such  databases. 

Other  tools,  such  as  Inxight  and  GeoFusion,  produce  graphical  representations  of  data  obtained  through 
searches.  GeoFusion,  which  makes  software  that  can  extrapolate  from  geographic  data,  was  able  to 
render  a  map  of  the  movements  of  a  tagged  tuna. 

By  contrast,  Inxight's  software  creates  a  map  of  relationships  between  names  and  topics.  A  search  on  the 
White  House  and  business  showed  that  Haliburton  is  the  corporation  linked  most  often  to  the  White 
House.  In  a  similar  fashion,  IBM's  own  WebFountain  project  is  used  to  test  how  cohesive  certain  blogging 
communities  are  by  how  quickly  and  in  unison  they  react  to  news  events. 

File  systems  will  likely  begin  to  disappear  as  search  gains  popularity.  One  of  the  phenomena  that  Microsoft 
researchers  are  finding  in  MyLifeBits  is  that  files  are  largely  ad  hoc  categories  that  become  outdated,  said 
Jim  Gemmell  at  Microsoft  Research. 

Instead,  data  should  be  tagged  so  that  if  people  remember  a  name  or  part  of  a  name,  they  can  find  their 
way  back  to  documents  or  pictures  involving  that  person,  or  they  can  find  documents  created  on  the  same 
day  that  they  had  a  phone  conversation  with  the  person,  even  if  the  discussion  involved  something 
unrelated. 

"The  problem  is  not  that  we  keep  too  much  with  MyLifeBits.  The  problem  is  how  to  use  it,"  Gemmell  said. 

Poorer  nations  will  also  be  able  to  take  advantage  of  these  advances,  even  without  an  electrical  grid.  The 
Internet  Archive  has  created  mobile  bookmobiles  in  conjunction  with  Hewlett-Packard  and  others.  The 
bookmobiles  contain  a  printer  hooked  up  to  a  satellite  feed,  which  can  print  books  for  kids.  Two  are  in 
operation  in  India,  while  another  in  rural  Uganda  prints  about  1,500  books  a  week.  The  entire  bookmobile, 
including  the  cost  of  the  used  van,  is  $15,000,  and  100-page  books  cost  about  a  $1  to  print  and  bind  in  the 
van. 

"It  takes  about  12  to  15  minutes  to  make  a  book,"  he  said.  "It  is  cheaper  for  a  library  in  the  United  States  to 
print  and  give  away  a  book  than  retrieve  it." 
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Where  Do  Old  Web  Sites  Go  to  Die? 


When  URL  stands  for  U  R  Lost...  or  Left  Behind...  or  Languishing...  there's  still  hope. 

From:  Issue  85  August  2004,  Page  26 
By:  Fiona  Haley 
Illustrations  by:  Hal  Mayforth 

URL:  http://www.fastcompany.coni/magazine/85/mystery.html 


You're  happily  trolling  the  Web,  visiting  your  favorite  sites.  You  type  in  wvvw.meerkatfanatic.com,  and  a 
message  pops  up:  "404  ~  Page  not  found."  Or  a  cheesy  ad  from  a  hosting  service.  Huh?  What  gives?  It 
was  there  yesterday.  Where  did  it  go? 

In  a  world  where  the  average  life  span  of  a  Web  page  is  just  77  days,  sites  often  just  vanish.  Their 
owners  either  lose  interest  or  stop  paying  to  maintain  them  ~  and  the  host  providers  unceremoniously 
yank  them  off  the  Net.  It's  that  simple. 

Can  you  ever  find  them  again?  Sometimes  —  thanks  to  a  half-dozen  archive  services.  The  Internet 
Archive  (vvwvv.archive.org  ),  for  one,  saves  35  million  sites  every  two  months,  totaling  billions  of 
pages.  Want  to  flash  back  to  justballs.com,  a  dotcom  purveyor  of .  .  .just  balls  that  went  belly  up  in 
2002?  Plug  it  into  the  Archive's  Wayback  Machine,  and  presto!  Your  favorite  oldies,  resurrected.  It's  a 
matter  of  historic  interest,  says  the  Archive's  Michele  Kimpton:  "As  time  goes  on,  people  will  want  to 
save  and  access  the  Internet,  and  the  content  will  become  more  valuable." 


Copyright  ©  2004  Gruner  +  Jahr  USA  Publishing.  All  rights  reserved. 
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How  do  you  measure  the  popularity  of  items  available  for  download  or 
sale  on  the  Internet' 

Researchers  from  Cornell  University  and  the  Internet  Archive  have 
devised  a  way  to  measure  users'  reactions  to  an  item  description:  a 
batting  average  of  the  number  of  users  who  go  on  to  download  the  item 
divided  by  the  number  of  users  who  read  the  description  This  mirrors 
the  traditional  baseball  batting  average  of  the  ratio  of  a  player's  hits  to 
at  bats. 

The  item  description  batting  average  is  different  from  just  tracking  the 
output  of  a  hit  counter,  which  measures  the  raw  number  of  item  visits 
or  downloads,  said  Jon  Kleinberg.  an  associate  professor  of  computer 
science  at  Cornell  University  "The  batting  average  addresses  the  more 
subtle  notion  of  users'  reactions  to  the  item  description  as  it  appears  in 
the  fraction  of  users  who  go  on  to  download  the  item  " 

A  users'  batting  average  reveals  something  about  the  nature  of  on-line 
popularity,  can  make  users  explicitly  aware  of  shifts  in  popularity,  and 
allows  administrators  of  large  sites  to  quickly  identify  sudden  and 
potentially  signitlcant  effects  on  the  popularity  of  particular  items  and 
prepare  accordingly 

The  researchers  found  that  on  the  Web,  popularity  often  changes 
abruptly  rather  than  gradually   "For  example,  an  item  would  be  getting 
downloaded  at  a  rate  of  roughly  .18  percent,  and  then  at  exactly  8:  35 
am  on  February  20,  it  would  drop  to  about  24  percent  and  stay  there 
for  the  next  several  days,"  said  Kleinberg 

Although  the  abrupt  shifts  were  initially  surprising,  "the  underlying 
reason  is  intuitive,"  said  Kleinberg.  "Your  popularity  on  the  Web  is 
affected  by  having  a  high-traffic  site  decide  to  link  to  you  or  mention 
you  in  some  way  and  this  link  or  mention  is  added  at  a  precise  moment 
in  time,"  he  said 

This  draws  a  lot  of  traffic  to  the  item's  description,  and  the  traffic  is  "a 
new,  larger  mix  of  users  with  a  possibly  different  set  of  interests  than 
the  niche  population  that  has  been  viewing  it  up  until  then,"  said 
Kleinberg  This  can  either  drive  the  batting  average  up  abruptly  if  this 
larger  population  decided  that  they  really  liked  the  item,  or  down  if,  by 
and  large,  they  did  not,  he  said 

In  working  with  data  from  the  Internet  Archive,  which  maintains  a 
digital  collection  of  publicly  available  films,  concerts  and  books,  the 
researchers  found  that  abrupt  shifts  corresponded  closely  to  real-world 
events  that  drove  what  was  often  a  new  mix  of  users  to  view  an  item's 
description 

Analyzing  item  popularity  dynamics  at  a  given  Web  site  can  help 
characterize  the  impact  of  a  range  of  events  taking  place  both  on  and 
off  the  site,  according  to  Kleinberg  The  batting  average  shows  a 
change  in  the  make-up  of  the  population,  as  reflected  in  the  fraction 
that  was  interested  in  downloading  the  item,  he  said 

A  practical  benefit  of  the  batting  average  is  making  users  aware  of 
popularity  shifts,  said  Kleinberg  "For  each  item,  we  can  imagine 
keeping  a  running  history  of  the  on-site  spotlighting  and  active  external 
links  that  have  affected  the  item  over  the  previous  years  and  months, 
together  with  a  summary  of  the  effect  on  the  item's  popularity."  he  said 

The  same  goes  for  reviews  of  items,  said  Kleinberg.  "Since  the 
appearance  of  a  strong  positive  or  negative  review  can  affect  the 
batting  average,  there's  the  intriguing  possibility  of  creating  a 
quantitative  measure  of 'review  impact' " 

The  researchers  tracked  abrupt  shifts  in  batting  averages  using  an 
algorithm  based  on  Hidden  Markov  Models,  a  type  of  pattern 
recognition  algorithm  that  observes  a  sequence  of  states  in  order  to 
identify  the  system  producing  them  and  make  predictions  about  future 
states  Hidden  Markov  Models  are  widely  used  in  speech  recognition 
software,  a  spoken  word  is  the  system  and  the  sounds  that  make  up  the 
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word  -  phonemes  ~  are  the  states. 


"In  this  case,  the  hidden  states  correspond  to  the  possible  values  of  the 
current  batting  average  for  the  item,  and  so  we  can  analyze  the 
sequence  of  item  downloads  to  estimate  the  most  likely  moments  at 
which  this  batting  average  changed,"  said  Kleinberg 

The  researchers  are  working  on  models  that  will  be  able  to  infer  what  a 
user  is  doing  and  what  a  user  is  trying  to  accomplish  when  visiting  a 
site  like  Amazon,  arxiv  org,  or  the  Internet  Archive  "The  batting 
average  and  its  analysis  through  Hidden  Markov  Models  is  a  simple 
example  of  such  a  model,  but  richer  models  might  allow  us  to  guess 
that  one  user  is  lost  and  not  sure  of  what  to  purchase,  while  another  is 
in  the  process  of  seeking  a  specific  item,"  said  Kleinberg 

Applications  based  on  the  researchers'  current  method  are  possible  in 
the  near-term;  better  models  that  can  infer  what  a  user  is  doing  are 
several  years  out,  said  Kleinberg 

Kleinberg's  research  colleagues  were  Jonathan  Aizen  of  the  Internet 
Archive  and  Daniel  Huttenlocher  and  Antal  Novak  of  Cornell 
University  The  work  appeared  in  the  January  6,  2004  issue  of  the 
Froceechngs  of  llie  Naliomil  AcaJciny  ()/  Scit'ncvs.  The  research  was 
funded  by  the  National  Science  Foundation  (NSF)  and  the  David  and 
Lucile  Packard  Foundation. 
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Musicians  offering  fans  a  chance  for  free  music 

Deck  head  goes  here  goes  here 


Texthere  are  two  great  nights  of  music  at  The  Loft  this  weekend. 
T 

One  band  you've  maybe  heard  of  --  legendary  jammer  Col.  Bruce  Hampton  and  the  Codetalkers  plays  Saturday 
night. 

The  other  band,  maybe  not.  That's  Memphis'  Stout ,  who  will  lay  down  their  bluesy  groove  tonight. 

This  isn't  about  those  bands.  Suffice  it  to  say,  both  are  money  well  spent  and  fans  of  the  Colonel  probably  also  would  dig 
those  upstarts  in  Stout. 

This  is  about  how  anyone  with  a  computer  and  modem  can  check  out  both  bands'  music  for  free.  Not  to  mention  music 
from  a  couple  hundred  other  bands  you  might've  heard  of. 

The  Cowboy  Junkies  ,  Ryan  Adams  ,  Little  Feat  ,  Billy  Bragg  ,  the  Dirty  Dozen  Brass  Band  ,  Guster  ,  Fugazi  , 
Tenacious  D.  ,  The  Minutemen  ,  the  Ass  Ponys  ,  Ben  Kweller  and  Hank  Williams  III  ,  to  name  a  few  of  the 
better-known  acts  represented. 

Archive.org  is  a  Web  site  by  free-  thinking  folks  whose  only  goal  is  to  share  information.  They  make  available  for  download 
the  text  from  public-domain  books,  photos,  even  old  films.  And  there's  lots  and  lots  of  live  music  —  some  15,000  full 
concerts  by  bands  who  agree  to  let  their  stuff  go  online,  just  as  long  as  it's  free. 

These  are  bootleg  shows,  sometimes  taped  directly  from  a  soundboard,  sometimes  through  mikes  set  up  on  tripods,  and 
sometimes  through  those  little  hand-held  recorders  hidden  in  trench  coats  by  fans.  That  last  group  of  guys,  in  particular, 
should  receive  our  thanks,  since  they're  willing  to  venture  into  public  looking  like  child  molesters  for  the  sake  of 
documenting  great  bands. 

Stout's  on  there,  and  a  few  versions  of  Col.  Bruce's  bands  like  the  Codetalkers  and  the  Aquarium  Rescue  Unit.  You  can 
browse  an  alphabetical  list  of  artists,  or  search  for  your  favorites,  or  search  for  the  shows  with  the  highest  "batting 
average"  (which  means  the  highest  percentage  of  people  who  checked  out  the  set  list  and  decided  to  download  the  show). 

Oh,  and  Grateful  Dead  fans,  this  is  your  nirvana.  There  are  2,462  shows  available  by  Garcia  and  Co.,  to  date. 

Other  Lofty  acts  are  there,  too,  including  Angle  Aparo  ,  Snake  Oil  Medicine  Show  and  Cast  Iron  Filter  . 

A  word  of  warning  to  dial-up  users:  Stick  to  the  section  dedicated  to  MPS  shows  if  you  ever  plan  to  hear  a  dial  tone  again. 
Many  of  the  shows  are  in  file  formats  that  better  preserve  sound  quality.  But  that  can  make  a  3-mlnute  song  more  than  30 
megabytes  in  size,  which  is  about  10  times  the  size  of  the  equivalent  MP3  file. 

Get  clicking  at  www.archive.org/audio. 

In  fact,  set  up  some  stuff  to  download  while  you're  heading  out  to  a  live  show  of  your  own.  Billy  Gewin  opens  for  Stout 
tonight  at  10  p.m.  ($5).  Saturday,  Whiskey  Bent  opens  for  The  Codetalkers,  also  at  about  10  ($12).  Details  on  either: 
596-8141. 

Jo  Jo's  in  museum 

After  I  harangued  the  Country  Music  Hall  of  Fame  in  Nashville  pretty  hard  a  couple  weeks  back  for  mistakenly  identifying 
our  town'sJo  Jo  Benson  as  being  from  Ohio,  the  museum's  Michael  Gray  contacted  me  to  convey  his  mea  culpa. 

Gray,  an  associate  editor  at  the  museum,  actually  was  excited  to  learn  Benson's  whereabouts,  since  he  wanted  to  let  the 
soul  singer  know  that  the  video  to  "Soul  Shake"  is  on  an  endless  video  loop  as  part  of  the  ongoing  "Night  Train  to 
Nashville"  exhibit  on  classic  soul  and  R&B. 

So  that's  another  good  reason  to  check  out  the  exhibit.  Visit  www.countrymusic-  halloffame.com  for  details,  or  call  (800) 
852-6437. 
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Eric  Eldred  of  Derry,  N.H.,  who  uses  his  Internet  Bookmobile  to  offer  advice  on  downloading  books  in  the  public 
domain,  Is  seeking  legal  advice  after  a  state  park  ranger  at  Walden  Pond  asked  him  to  leave.  (Globe  Staff 
Photo  /  Danne  Rathe) 

Fighting  to  be  free  oci^ostoinPiobc 

Thoreau  lover  denied  bid  to  give  out  book  at  Walden 

By  Kathleen  Burge,  Globe  Staff  |  July  19,  2004 

CONCORD  --  It  was  an  idea  that  Henry  David  Thoreau  could  have  loved:  Earlier  this  month,  with  the  150th 
anniversary  celebration  of  Thoreau's  influential  book  "Walden,"  Eric  Eldred  thed  to  hand  out  free  copies  on  the 
shores  of  the  pond  where  the  author  had  famously  retreated  from  civilization. 

Eldred,  the  driver  and  chief  evangelist  behind  his  Internet  Bookmobile  based  in  Derry,  N.H.,  will  soon  roam  the 
country  teaching  people  how  to  download  free  books.  What  better  way,  he  thought,  to  celebrate  "Walden,"  the 
ultimate  paean  to  self-reliance,  than  to  show  people  how  to  make  their  own  books? 

But  Eldred  had  barely  driven  his  white-and-red  camper  into  the  parking  lot  of  the  Walden  Pond  Reservation  and 
tacked  up  a  handwhtten  sign  reading  "Free  Walden"  when  a  state  park  ranger  asked  him  to  leave.  Eldred  said  he 
was  told  he  needed  a  permit  to  hand  out  copies  of  the  book,  free  or  not,  and  would  be  arrested  if  he  continued. 

"Obviously,  Thoreau  didn't  ask  for  government  permission  before  he  published  'Walden,' "  said  Eldred,  sitting 
inside  his  bookmobile  last  week.  "It  seems  absurd  for  me  to  go  the  government  and  have  them  look  at  the  content 
and  see  whether  it's  approved  or  not.  ...  It  demeans  the  whole  spirit  of  Thoreau's  work." 

Eldred,  60,  has  been  seeking  legal  advice  on  whether  the  Massachusetts  Department  of  Conservation  and 
Recreation,  which  oversees  the  Walden  Pond  State  Reservation,  is  violating  his  constitutional  hght  to  free  speech 
if  it  refuses  to  allow  him  to  distribute  the  books. 

Eldred  became  a  conscientious  objector  during  the  Vietnam  War  after  he  graduated  from  Harvard  in  1966  and 
has  long  found  inspiration  in  Thoreau's  writing.  He  decided  to  keep  fighting  for  Free  Walden. 

"I  asked  myself  what  Thoreau  would  have  done,"  Eldred  said. 

The  controversy  has  sparked  debate  around  town  and  on  Internet  chat  boards  across  the  country,  especially 
when  Eldred  said  park  officials  were  concerned  that  his  free  books  would  hurt  sales  at  the  Shop  at  Walden  Pond, 
which  sells  copies  of  "Walden"  in  English,  German,  and  Japanese,  as  well  as  black  T-shirts  with  one  of  the 
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author's  most  well-known  exhortations:  "Simplify,  simplify." 

Denise  Morrissey,  the  park  supervisor  who  told  Eldred  he  had  to  leave,  said  her  agency  discourages  competition 
from  outsiders  who  could  take  away  business  from  the  two  concessions  that  pay  for  a  spot  on  the  reservation:  an 
ice  cream  truck  and  the  gift  shop  run  by  the  Thoreau  Society. 

"If  you're  going  to  give  away  books  for  free,"  she  said,  "it  might  take  away  business"  from  the  shop. 

Morrissey  sees  the  spat  as  nothing  more  than  a  bureaucratic  snag.  Everyone  who  wants  to  do  more  at  Walden 
than  swim  or  walk  the  paths  --  from  filmmakers  who  want  to  make  movies,  to  couples  who  wish  to  marry  -  needs 
state  permission,  she  said. 

Eldred  had  no  permit,  and  it  was  impossible  for  him  to  get  one  that  day,  Morrissey  said.  First  she  must  sign  off  on 
permits,  and  then  she  submits  them  for  approval  to  the  conservation  and  recreation  department. 

"We  try  to  be  very,  very  sensitive  about  what  kind  of  activities  go  on  here,"  she  said.  "Walden  has  a  unique  image 
that  needs  to  be  upheld." 

Morrissey  said  she  would  require  more  information  about  Eldred's  enterprise  before  she  could  decide  whether  to 
recommend  that  he  be  allowed  to  give  away  books.  She  said  she  was  startled  to  see  him  in  the  parking  lot  with 
his  Internet  Bookmobile. 

"He  had  a  generator,"  she  said.  "He  had  his  big,  mobile-home  kind  of  vehicle." 

Jayne  Gordon,  the  executive  director  of  the  Thoreau  Society,  said  she  appreciates  Eldred's  work  and  doesn't 
think  it  would  cut  deeply  into  the  profits  of  her  group's  book  sales.  But  she  relies  on  Morrissey's  judgment,  she 
said,  to  run  the  park. 

Jonathan  Zittrain,  codirector  of  the  Berkman  Center  for  Internet  &  Society  at  Harvard  Law  School,  has  advised 
Eldred  to  continue  seeking  a  permit  to  hand  out  "Walden"  at  the  pond.  Zittrain,  who  was  cocounsel  for  Eldred  in  a 
copyright  case  that  went  to  the  US  Supreme  Court,  is  optimistic  that  the  state  will  see  the  value  of  Eldred's  work. 

"Thoreau,  in  many  ways,  stood  for  the  integration  of  ideas  with  the  physical  environment,"  Zittrain  said.  "At  the 
core,  what  Eric  is  doing  is  showing  how  ideas  can  become  physical.  To  be  able  to  produce  'Walden'  at  Walden, 
it's  such  an  extraordinary  bridge  between  the  past  and  the  present." 

Eldred  has  proposed  a  compromise  at  Walden  Pond:  He  would  show  people  how  to  print  the  book  themselves, 
and  give  free  copies  only  to  those  who  first  bought  a  book  from  the  Thoreau  Society  store.  He  had  a  similar 
arrangement  with  a  Derry  bookstore  last  month,  where  he  parked  the  bookmobile  near  the  store  and  gave  away 
free  books  to  customers  who  showed  receipts  from  the  store. 

Eldred  was  the  lead  plaintiff  in  the  copyright  case  decided  last  year  by  the  Supreme  Court.  Eldred  and  his  lawyers 
had  asked  the  court  to  overturn  a  law  passed  by  Congress  in  1998  that  extended  the  term  of  copyright  protection 
for  an  additional  20  years.  Under  the  old  law,  copyright  protection  generally  expired  70  years  after  the  death  of  the 
author.  The  court  found  the  new  law  constitutional. 

Eldred,  who  publishes  books  freely  available  in  the  public  domain  on  his  website,  www.eldritchpress.org,  is  a 
former  computer  programmer  who  suffers  from  repetitive  strain  injuries  and  lives  off  Social  Security  disability 
payments. 

Eldred  lives  in  the  bookmobile,  now  parked  in  the  driveway  of  the  Derry  house  where  he  once  lived  with  his 
ex-wife.  He  is  staying  in  New  England  until  October,  and  making  arrangements  to  visit  schools  before  he  heads 
south. 

"I  don't  want  to  be  really  isolated,"  Eldred  said.  "I  feel  there's  something  productive  that  I  can  do  to  help  society." 

Kathleen  Burge  can  be  reached  at  kburge@globe.com.  ■ 
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La  biblioteca  iiniv^ersale  di  Boiges:  quel  sogno 
qualcunoloinseguesulWeb.  EconSomiliardi 
di  pagine  archiviate  sta  per  riuscirci 


<liMarroDtSERnS 

oQuando  si  proclamo  che  la  Biblioteca  com- 
prendeva  tutti  i  libri,  la  prima  impressione  fu  di 
straordinaria  felicita.  Tutti  gli  uomini  si  senti- 
rono  padroni  di  un  tesoro  intatto  e  segretoi».  Se 
potesse  riscrivere  oggi  la  Biblioteca  di 
Babele,  straordinario  racconto  del  1944,  Jorge 
Luis  Borges  sognerebbe  ancora  un  archivio 
della  conoscenza  nche  coincide  con  Tuniver- 
so  stessoo?  Forse  no,  visto  che  oggi  i  computer 
e  la  Rete  rendono  gia  potenzialmente  archivia- 
bile  e  accessibile  tunc  lo  scibile  umano. 
K  lion  solo  qufllo  figlio  ilella  Giilitssia  <iii- 
tenberg.  iiui  aiirhp  la  nuisit-a,  i  film,  i  pro- 
gnunmi  ttMe\isi\i  e  i  siti  Web. 
A  sostenerlo.  lOii  im  fenore  quasi  inLssio- 
nario.  e  im  signore  ajiiencaiio  dalln  sgiiar- 
do  \igile  di  nome  Bri'w.stiT  Kahle  Allievo 
di  Manin  MiiLsk>'  nel  L-on»o  di  iiUflligenza 
anificiale  del  Mit  nci  primi  aiini  Ottaiita. 
Kahle  era  salito  agli  onori  delle  cronacho 
im  paio  tli  aruii  fa  per  ■  Internet  Bookmobi- 
le-, iin  fur^oiKuio  dotato  di  accesso  satelli- 
tai-p  alia  IJete  rhe  e  m  grado  «lj  .scariciire. 
stajupare  e  nleii;atv  in  porhi  niuuui  un  libro 
tra  luia  rosa  di  20  niiJa  titoli. 
Kalile  e  anche  il  paiiiv  fonclatore  di  -Inter- 
net Arcbive<'.  un  progetto  nionunientale 
che  ha  lo  .sropo  di  -reudere  unjvei-salniente 
aeressibile  tiitta  la  tonfiseenza  lunana.  per 
senipre.  eon  una  dispotiibiiiia  di  biinda  pas- 
sante  LUiniitata-.  M  di  la  della  mes;aloniania 
apparente,  se  si  consulta  11  rl  del  sito 
(H-inr.rtirliiir.oni)  ci  si  rende  conto  che 


I  aniiiMo  raccoglie  lya  luui  iiuiuiiita  di  daii 
inipresbionanti:  oltre  10  niila  rejastr;uii)ni 
di  coiKvrti.  20  niila  Lbri.  piii  di  2  niila  file  m- 
deo  e  rnra  ■'V5  niiliuiii  fli  .siti  Web  airhi\ian 
La  stona  del  progeiio  inizia  nel  Hl'tv  (|Uim- 
ilo  KaJilP  deride  di  veiideiv  ail  .Xnienea  (in- 
line la  sua  conipagnia  di  Web  publishnig, 
Wais.  e  in\t'ste  i  proventi  m  .\K'xa,  im  data- 
base-motore  di  rirerca  che  racco^lie  e  ar- 
chi\"ia  1  niateiiiilj  che  tixna  sid  Web.  It^  [lan- 
uership  con  .\lexii  ( coiitn)llata  oggi  da  .\nxa- 
zoni  Kahle  s\ilijppa  Tlw  Waijback  Mnrhi- 
HP  iLa  mwrliiiui  (M  feitifHt).  un  database 
che  consen  a  online  al  momenlo.  si  legge 
nel  sifo.  ciiva  :^)  nub;uT.li  di  imagine  Web  II 
suo  iLso  e  senipbce  e  ittrnitivo.  Basta  infatti 
digitare  lui  uKlihxzo  Web  e  luia  data  e  ci  si 
ritrova  caiapultati  sul  sito  <ii  Vahoo!  del 


Ma  quanto  e  libera  la  Bhc! 


A  novembra,  il  sito 

italiano  Liber  Iber 

ailPrGsenoMaPufo 

CQiTprannodiscaririi 

di  «litirt  liberi", 

Centinsia 

di  romarf ,  saggi. 

opere  teatrsli 

estoncne 

ilcuicopvngntsscaduto 

sano  access  bill  a  ujtb 

UterLber.'ib'blioteca 

telematica  a  accesso 

g^alutO'Coilabora 

anche  al  progetto 

GNUtenaergperla 


creai'Oie  e  'a  truizione 
di  I'bn  eieitronici 
www.liberiibeT.it 

AsenembreiaBbc 
meaara  m  Rets 
a  disposizione 
dei  suoi  abbonati, 
ilsuo  archivio  vdeo 
Ag!:  spettaton  verrj 
oermessDdiscancare 
gratuiiamsnta  i  .Tiateriaii 
di  uro  dei  piu  g.'aniji 
afchwdelmondo, 
e  di  rijiiiiaarii 
e  nmortani  s-anza  'ucro 
www.bbc.am 


il'lMi  11  .su  nui'llo  del 
\civ  i'luk  Titifs 
del  ll*!*7  liui  rapida 
<<insulfazione  <i 
ricorda  siibito  j^l 
come  7-.S  ;uini  in 
Intemet  e<iuiva]uatii 
211  iuuii  di  tv  o  .'i(i  di  radio,  -  II  teiu|XJ  di  \ita 
medio  di  un  dixtimento  .sul  Web  e  di  100 
uiomi-.  ncorda  Kahle.  Per  sl;ire  al  j)a.s.so 
con  que-sti  tempi,  ogni  dtie  senimane  i  no- 
stri  software  catiunino  e  archi%iano  delle 
istantainH»  del  Web- 
La  macchma  del  tempo  e  un  dispo.sitivo 
automat  ICO  che  cresce  ai  ritnn>  di  migliaia 
di  mJhardi  di  b\les  al  nu'se,  lortiuido  con- 
iro  la  rapida  obsolescenza  <li  mdirtzzt  e 
contenuti  che  segna  la  Retf.  Alia  lotta  con- 
rro  I'oblio  degli  esseii  uniani  si  somma 
quella  contro  I'obbo  delle  ina<<'hine.  Se  il 
problema  dei  custodi  delle  biblioteche  an- 
tiche  era  e\itare  gli  mcentli  e  copjare  i  vi> 
luini,  aj\che  il  nioilenio  lijbliotecario  tligi- 
sale  cleve  jjssit'Urai-si  che  i  dati  non  vadano 
peiNi  Per  (]uesto  Internet  Ai'clii\e  ha  dona- 
lo  ima  copia  dei  dati  alia  nuo\a  Bibliotwa 
.\lexandrina  (inaugiinita  al  t'airo  nell'orto- 
bre  2002,  sotto  I'egida  dellTnesco)  e  ha 
streito  partnership  con  pnwider  come 
XstAll,  Internei2  e  Surfnet  Solo  qiiando 
tuni  i  nostri  dati  Siinuuio  n-plicati  in  Okui- 
da.  in  Kgitto,  in  India  e  m  Cma  potremo 
<lormire  soiuu  tranquilli  ed  e\-itiuv  catasln> 
li  come  lo  stnncf)  incendio  della  bibliotwa 
di  Alessandna-  conclude  Kahle  i 
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Preservation  sites 

The  British  Library  is  part  of  a  consortium  which  aims  to  archive  6,000  websites.  So  why 
are  they  doing  it  and  what  will  be  the  selection  criteria,  asks  Guy  Clapperton 

Guy    Clapperton 
Monday   June    28,    2004 

The    Guardian 

In  the  50s,  television  became  a  popular  mass  medium,  20  years  or  so  after  its  invention.  That's 
common  knowledge.  Also  widely  known,  at  least  in  media  circles,  is  that  initially  TV  was  seen  as  a 
disposable  medium,  and  few  programmes  were  kept  from  its  earliest  days. 

Archivists  now  squint  at  snowstorm  screens  that  represent  the  only  moving  images  left  from  the 
30s,  fans  of  60s  science-fiction  shows  marvel  at  the  sole  surviving  episodes  of  their  favourite  sagas 
and  archive  compilation  shows  screen  grainy  clips  of  relatively  recent  programmes,  clawed  back 
from  overseas  sales  and  second-  or  third-generation  prints. 

It  is  agreed,  almost  universally,  that  had  people  foreseen  the  interest  that  would  be  shown  in  these 
older  programmes  later  on  (and  had  the  storage  technology  at  the  time  not  been  so  bulky  and  space 
consuming),  the  BBC  and  others  would  have  taken  more  care  in  preserving  them.  Given  this 
backdrop,  it  may  seem  odd  to  many  that  so  little  has  apparently  been  done  to  preserve  electronic 
copies  of  websites  and  virtually  held  documentation. 

The  problem  has  been  attracting  attention  for  a  while;  in  April,  the  Guardian  reported  on  some  of  the 
technical  issues  surrounding  storing  the  data  on  disks  that  become  obsolete  too  quickly  (The  history 
dustbin?  April  19). 

Now,  the  UK  Web  Archiving  Consortium  (comprising  six  organisations,  including  the  British 
Library),  aims  to  store  about  6,000  websites  for  posterity.  This  is  in  advance  of  any  legislation 
obliging  people  to  submit  copies  of  websites  and  other  electronic  copy  to  legal  deposit  libraries.  The 
initial  project  will  take  two  years,  after  which  the  consortium  will  review  its  progress  and 
objectives. 

Mark  Middleton,  the  British  Library's  web  archiving  programme  manager,  explains  that  the  scheme 
will  need  to  be  extremely  selective  in  the  first  instance.  "We  will  need  to  get  every  website  owner's 
permission  to  copy  their  site  into  the  library,"  he  says. 

This  means  a  lot  of  work  by  the  British  Library  and  the  other  consortium  members,  which  include 
the  National  Libraries  of  Wales  and  Scotland,  the  National  Archive,  the  Wellcome  Trust  and  the  Joint 
Information  Systems  Committee  of  the  higher  and  further  education  councils.  Each  will  use  its  own 
collection  policy  to  determine  exactly  what  it  believes  should  be  preserved,  and  then  will  plough 
internal  resources  into  gaining  the  necessary  permissions.  "We  will  be  identifying  sites  that  have 
political,  scientific,  social  or  artistic  interest  for  the  nation  and  for  future  generations,"  says 
Middleton. 
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The  National  Archive  has  been  retaining  copies  of  government  sites  for  some  time.  Adrian  Brown,  the 
Archive's  services  manager  for  digital  media,  says  his  organisation  takes  a  snapshot  of  the  sites 
periodically,  the  frequency  depending  on  how  often  the  site  is  likely  to  change.  "Our  real  motivation 
for  this  is  the  1 999  Modernising  Government  white  paper,  which  set  the  target  of  making  all 
government  services  available  online  by  2005." 

The  Archive  decided  that  websites  were  becoming  an  increasingly  important  facet  of  government 
interaction  with  the  citizen,  and  there  was  a  need  to  collect  the  sites  involved  for  posterity.  Certainly 
the  difference  between  the  modern  sites  and  some  of  the  earlier  efforts  is  marked;  only  last  week 
Brown  was  looking  at  the  Home  Office  site  from  2000,  which  was  very  text-based  and  old-fashioned 
by  modern  standards;  the  changes  In  style  over  a  period  as  short  as  four  years  has  been  remarkable. 

The  National  Archive's  methods  have  evolved  and  will  continue  to  do  so.  "The  very  first  snapshot  we 
took  was  of  the  Number  10  website  just  before  the  2001  general  election,"  explains  Brown.  "Then 
we  set  up  a  project  to  start  taking  regular  snapshots  of  a  selected  group  of  government  websites." 

The  snapshots  were  contracted  out  to  a  US  organisation  called  the  Internet  Archive,  and  the  project 
continues  to  this  day. 

"We  identified  six  basic  functions  of  government,  things  like  defence,  foreign  policy  and  management 
of  national  finances,  and  we  picked  out  a  representative  group  of  websites  within  each  of  those  areas." 

Updates  happen  sometimes  every  week  and  sometimes  once  every  six  months.  Brown  hopes  that 
involvement  in  the  new  consortium  will  offer  the  National  Archive  even  more  flexibility  in  the  way 
it  stores  the  government's  electronic  material.  "Websites  are  actually  quite  complicated  to  collect 
and  preserve,"  he  says.  A  public  Inquiry  website,  for  example,  will  change  almost  every  day  while 
the  inquiry  Is  happening  and  then  become  completely  static  once  the  inquiry  Is  finished.  The  National 
Archive  would  normally  take  a  snapshot  of  the  finished  product  to  preserve,  but  there  are  other  less 
clear  examples  -  sites  that  don't  actually  have  a  finished  product  but  which  evolve  and  change  over 
time. 

"With  the  outbreak  of  Sars,  for  instance,  we  decided  to  start  collecting  public  health  websites  just  to 
capture  how  it  was  being  talked  about  and  how  the  information  was  being  put  out  during  that  period." 

There  are  other  technical  challenges  to  negotiate.  Middleton  highlights  a  number  of  them:  "There  are 
already  some  fairly  complex  things  on  the  web;  pages  you  can't  get  to  without  a  password,  and  Flash 
technology,"  he  says.  Stepping  out  of  the  government  sites  and  Into  the  commercial  world  can 
increase  these  complexities.  A  sponsor  or  partner  company  might  add  something  to  someone's  site  - 
an  advert  or  a  sponsor  message  -  which  uses  material  stored  on  a  server  other  than  the  one  the 
reader  believes  they  are  looking  at.  Should  this  third  party  material  be  retained  as  part  of  an 
archived  page  or  not?  There  is  no  clear  answer  as  yet,  but  keeping  the  links  Intact  and  any  active 
software  running  would  be  more  complicated  than  taking  a  simple  screen  grab. 

Maintaining  web-crawling  technology  that  will  monitor  all  of  these  changes  could  become 
increasingly  Important  in  the  future.  The  British  Library  is  a  legal  deposit  library,  in  other  words 
there  Is  a  legal  obligation  to  send  It  a  copy  of  any  hard-copy  publication  a  company  or  individual 
might  publish.  For  the  moment  this  doesn't  apply  to  electronic  documentation  but  Middleton  believes 
the  time  will  come  -  once  the  right  technology  is  in  place.  "The  timing  has  to  be  decided,"  he  says. 

This  then  throws  up  more  technical  and  logistical  questions.  The  need  to  get  permission  from  every 
site  owner  to  store  a  copy  of  a  site  will  be  eliminated  when  submission  to  the  BL  becomes  obligatory. 
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but  that's  only  one  practical  issue. 

The  BBC  tried  to  launch  what  it  called  a  Domesday  Project  in  the  80s  to  catalogue  British  life 
electronically,  but  the  scheme  ran  aground  when  the  disks  became  obsolete  and  unreadable.  There  is 
also  the  problem  of  how  to  keep  a  record  of  websites  that  change  extremely  frequently.  A  news 
website  like  that  of  the  Guardian  or  the  BBC  will  alter  as  more  stories  are  added  and  won't  be  the 
same  for  as  long  as  a  few  hours.  Should  someone  keep  every  view?  And  to  what  end?  These  are  among 
the  decisions  that  will  have  to  be  faced. 

If  all  goes  according  to  plan,  the  archive  will  be  made  publicly  available  in  early  2005. 
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2004  100  Top  Websites  You  Didn't  Know  You  Couldn't  Live  Without 

April  20,  2004 

What  is  a  top  Web  site?  It's  a  site  you  rely  on — one  you  just  have  to  tell  your  friends,  family,  coworkers,  and 
neighbors  about.  It  is  surprisingly  useful,  funny,  informative,  addictive.  It  does  something  cool  you've  never 
seen  before. 

You  already  know  about  Expedia.com  and  Monster — not  to  mention  PC  Magazine  Online  They're  part  of  our 
ever-shifting  canon  of  Top  100  Classic  Sites  But  did  you  know  you  could  win  $10  million  by  building  a  working 
spaceship?  Where  to  go  to  find  out  whether  the  presidential  candidates  have  their  facts  straight?  Or  how  to 
find  out  why  your  computer  is  suddenly  acting  so  funny?  Read  on:  These  sites  will  soon  have  you  wondering 
how  you  ever  did  without  them. 
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Vampire  Slayer  and  The  Family  Guy. 

CNN.com 

www.cnn.com 

They  shamefully  overplayed  Howard  Dean's  "I  have  a  scream"  speech,  but  this  is  still  one  of  the  best  sources 

for  news  online. 

!  Click  on  CNN  to  Go  to  find  out  how  to  get  CNN  content  and  breaking  news  alerts  on  your  cell  phone 

El  Online 

www.eonline.com 

Your  favorite  shows,  movies,  gossip,  and  award-winning  (okay,  not  really)  Paris  Hilton  coverage. 

ESPN.com 

www.espn.com 

From  the  Schumacher  who  races  F1  to  the  one  who  rides  Xtreme  Bulls,  it's  all  the  sports  you  could  ever  want. 

Internet  Archive 

www.archive.org 

An  Internet  library  that  holds  more  than  300  terabytes  of  text,  as  well  as  audio,  video,  and  images  from  Web 

sites. 

I  It's  not  just  Web  sites.  The  archive  is  also  trying  to  add  vintage  software  apps  to  its  collection,  but  copyright 

issues  have  slowed  its  progress.  Stay  tuned. 


The  Internet  Movie  Database  (IMDb) 

www.imdb.com 

Your  first  stop  for  movie  trivia  and  information  online. 

MSN  Slate  Magazine 

http://slate.msn.com 

Plenty  of  biting  content,  ranging  from  politics  and  business  to  food  and  travel. 

I  You  can  get  Slate  content  in  e-book  form,  free.  Just  go  to  Output  Options  |  ebooks.  You  can  get  all  articles 

published  in  the  past  seven  days,  or  choose  which  ones  you  want. 

The  New  York  Times  On  the  Web 

www.nvtimes.com 

The  paper  of  record;  Visit  for  news,  op-ed  pieces,  thoughtful  analysis,  and,  sadly,  internal  scandals. 

I  The  65-cent  electronic  edition  downloads  to  your  desktop  or  notebook  and  looks  just  like  the  real  paper — and 

it's  searchable. 

NPR 

vTOw.npr.org 

Terry  Gross,  Click  and  Clack,  and  This  American  Life:  some  of  radio's  best,  online. 

The  Onion 

www.theonion.com 

"Six  Dead  in  West  Point  Panty  Raid."  The  laughs  keep  coming,  thanks  to  "America's  Finest  News  Source." 

RollingStone.com 

VTOW.rollingstone.com 

Whom  is  Britney  Spears  dating  now?  Is  Norah  Jones's  new  album  worth  your  $18? 

Find  out  here. 

Salon.com 

wvTO/.salon.com 

You  can  still  read  some  of  Salon. com's  incisive  content  free,  or  view  an  ad  to  get  a 

premium  day  pass. 

I  Note  that  the  $35  annual  Premium  subscription  also  gets  you  subscriptions  to  three 

magazines:  Wired,  National  Geographic  Adventure,  and  U.S.  News  and  World  Report. 

ScienceDaiiy 

VTOw.sciencedailv.com 

Keep  abreast  of  the  latest  scientific  breakthroughs. 

Television  Without  Pity 

vTOw.televisionwithoutpitv.com 

Missed  the  latest  Surreal  Life?  Recaps  of  the  best  in  trashy  TV,  with  plenty  of  snark. 

!  The  recaps  are  funny  enough,  but  the  discussion  forums  are  even  funnier. 

Wired  News 

vTOw.wired.com 

Still  the  Web's  best  source  for  great  insights  on  how  technology  affects  culture,  society,  and  business. 

CURRENT  EVENTS  AND  NEWS  YOU  CAN  USE 
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institutional  collection  in  the  world.  The  Computer  History 
Museum  in  Silicon  Valley  is  starting  to  look  at  software  preser- 
vation seriously  too.  But  it's  going  after  largely  software  lan- 
guages and  packages  (some  of  which  are  also  in  imminent  dan- 
ger of  being  destroyed)  unrelated  to  games. 

Legal  Issues 

I've  been  working  with  the  Internet  Archive  (w  w  w.archive.org), 
a  nonprofit  institution  that's  "building  a  digital  library  of 
Internet  sites  and  other  cultural  artifacts  in  digital  form."  We 
discovered  possible  archiving  issues  involving  the  Digital 
Millennium  Copyright  Act  (DMCA),  which  may  have  made  it 
impossible  to  legally  archive  early  computer  software  and 
games,  even  for  accredited  institutions  wishing  to  store  limited 
amounts  of  private,  non-circulating,  archival  images.  So  we  peti- 
tioned the  Copyright  Office  about  these  access  protection  issues, 
and  the  U.S.  Copyright  Office  ruled  in  October  2003  that 
exemptions  should  be  added  to  the  anti-circumvention  clause  of 
the  DMCA,  to  be  valid  until  the  next  Copyright  Office  rulemak- 
ing in  2006  (w\v\v.c()pyright.gov/l2l)l/d()Cs/lil-ir,irian_statement_ 
0  i  .html).  The  exception  applies  to  "computer  programs  and 
videogames  distributed  in  formats  that  have  become  obsolete 
and  which  require  the  original  media  or  hardware  as  a  condition 
of  access." 


This  does  not  mean  titles  posted  as  abandonware  are  legal 
to  copy  as  you  please,  but  it  does  arguably  mean  official  insti- 
tutions can  make  a  limited  amount  of  private  archival  copies 
of  classic  software,  provided  they  own  the  original  physical 
copy,  and  their  copy  doesn't  violate  the  DMCA.  So  the  possi- 
bility now  exists  for  good  archiving  to  happen,  and  we're  in 
the  early  stages  of  starting  software  archiving  projects.  Bear  in 
mind,  the  Internet  Archive  does  not  claim  to  be  the  sole  solu- 
tion— just  one  of  many  possible  contributors,  especially  now 
that  the  DMCA  exemption  has  arguably  made  classic  game 
archiving  legal  in  the  U.S. 

This  situation  needs  a  critical  mass  of  developers  like  you 
bringing  your  technical  knowledge  to  bear  on  the  complex 
archival  problems.  Your  efforts  may  include  donating  old  retail 
software  for  archiving  or  even  allowing  some  of  the  less  finan- 
cially important  titles  in  your  company's  back  catalog  to  become 
freely  available  through  the  archives.  If  there's  playable  public 
content  in  these  putative  software  archives,  alongside  good  meta- 
data and  information  on  the  private  content,  then  the  pieces  will 
be  in  place  to  create  a  canonical  archive  of  games,  ensuring  the 
titles  you  worked  on  won't  disappear.    S^ 

SIMON  CARLESS  I  Simon  is  a  former  videogame  designer 
(Eidos,  Atari),  who  now  edits  the  popular  tech  web  site  Slashdot 
(iririr.ilj.^l'ilni.or'^).  He  can  be  contacted  at  scarless@gdmag.com. 
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Judge  indicates  he  will  limit 
government  evidence 
against  Al-Hussayen 

04/22/2004 

By  BOB  PICK  /  Associated  Press 

Prosecutors  pressed  a  federal  judge  Thursday  to  show  jurors  an 
Internet  Web  page  they  say  encourages  suicide  bombings  and  built 
by  University  of  Idaho  graduate  student  Sami  Omar  Al-Hussayen, 
who  is  being  tried  for  aiding  terrorism. 

But  defense  attorney  David  Nevin  argued  that  even  though 
Al-Hussayen  formatted  the  page  in  mid-2001  for  the  Islamic 
Assembly  of  North  America,  the  government  cannot  prove  he 
endorsed  its  content.  Nevin  claimed  introducing  the  Web  page 
would  taint  the  jury. 

"There  is  no  question  that  this  is  prejudicial,"  assistant  U.S. 
Attorney  David  Deitch  said.  "But  it's  prejudicial  because  it  tends  to 
show  the  guilt  of  the  defendant." 

U.S.  District  Judge  Edward  Lodge  said  he  was  inclined  to  permit 
the  Web  page  as  evidence,  but  said  he  wouldn't  rule  until  the  trial 
resumes  Monday. 

Deitch  claimed  it  makes  no  difference  whether  Al-Hussayen 
believed  in  the  content  of  material,  only  that  he  realized  it  would 
serve  to  help  finance  and  recruit  terrorists.  He  said  it  was  core  to 
their  case. 

However,  Nevin  said  if  simply  providing  access  to  objectionable 
material  makes  a  person  guilty,  then  Molly  Davis  —  the 
administrator  of  the  Internet  Archive  which  provided  the 
prosecution  with  copies  of  old  Web  pages  with  inflammatory 
content  —  should  have  been  arrested,  too. 

Lodge  earlier  Thursday  told  prosecutors  he  would  likely  limit 
the  e-mail  evidence  they  can  introduce  against  Al-Hussayen  in 
trying  to  prove  he  used  his  computer  skills  to  foster  terrorism. 

Lodge  withheld  ruling  on  any  specific  e-mails  sent  to 
Al-Hussayen  by  others  in  the  e-mail  group  devoted  to  Chechen 
Muslims  until  an  attempt  is  actually  made  to  offer  them  as 
evidence. 

Deitch,  who  argued  the  e-mails  should  be  allowed  to  show 
what  Al-Hussayen  knew,  has  said  the  value  of  the  e-mails  and  their 
content  should  be  left  to  the  inference  of  jurors. 

"I'm  not  comfortable  with  that,"  Lodge  told.  "I'm  not  going  to 
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let  the  jury  speculate 

At  issue  are  thousands  of  e-mails  from  the  Web  site 
http://www.qoqaz.com  for  which  Al-Hussayen  was  one  of  the 
designated  moderators,  but  Nevin  has  said  his  client  barely 
participated  with  the  site. 

Whether  more  than  15,000  additional  e-mails  and  9,000 
telephone  calls  government  agents  intercepted  in  their 
investigation  of  Al-Hussayen  will  be  introduced  as  evidence 
remained  uncertain. 

Nevin  wants  the  government  barred  from  using  any  information 
gleaned  from  that  material  against  Al-Hussayen  because  the 
material  was  not  made  available  to  the  defense  until  five  days 
before  the  trial  started.  The  government  claims  the  defense 
frittered  away  opportunities  to  review  the  material. 

Lodge  has  not  ruled  on  that  matter  directly.  However,  in  an 
order  issued  earlier  this  week,  he  said  he  would  give  the  defense 
additional  leeway  to  add  to  the  evidence  and  witnesses  it  may  offer 
during  the  trial  after  it  goes  through  that  material. 

The  government  has  cited  highly  inflammatory  e-mails  in 
building  its  case  against  Al-Hussayen.  The  34-year-old  Saudi 
national  is  charged  with  using  the  Web  site  of  the  Islamic 
Assembly,  based  in  Michigan,  as  the  foundation  for  an  Internet 
network  that  promoted  terrorism. 

As  a  moderator,  the  government  claims  Al-Hussayen  controlled 
content  on  the  site  and  did  little  to  prohibit  objectionable  postings. 


Nevin  conceded  that  "some  of  the  messages  are  inflammatory. 
These  are  explosive  e-mails." 

But  he  said  that  Al-Hussayen  posted  only  74  of  the  3,579 
messages  on  the  site  between  its  founding  in  January  2000  and  the 
last  check  on  Sept.  10,  2003,  by  Yahoo!,  which  organized  the  site. 

He  acted  as  moderator  just  17  times  —  all  but  one  of  those  in 
June  and  July  2000  —  and  essentially  dropped  his  interest  in  the 
site. 

One  of  the  most  inflammatory  messages  found  on 
Al-Hussayen's  home  computer  came  from  an  unidentified  person 
that  sought  targeting  information  on  U.S.  troops  in  Iraq. 

Nevin  said  the  e-mail  was  posted  just  before  midnight  and  only 
five  hours  before  federal  agents  raided  Al-Hussayen's  home  on  the 
Moscow  campus  and  arrested  him  Feb.  26,  2003. 

There  is  no  evidence  that  Al-Hussayen  even  saw  that  e-mail 
and  none  that  he  read  any  of  the  other  3,500  that  were  posted  on 
the  site,  let  alone  endorsed  their  content,  Nevin  said. 


On  the  Net: 

U.S.  District  Court  for  Idaho:  http://www.id.uscourts.gov 
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Battered  by  junk  and  reeling  under  makeshift 
fixes,  e-mail  is  ripe  for  reinvention. 
Here's  how  six  of  the  industry's  most 
provocative  thinkers  envision  a  brighter  day 


BE  SAVED? 


E-MAIL  IS  THE  VICTIM   OF  ITS   OWN  BACKWARD  ECO- 

nomics.  Anyone  can  send  a  message  to  anyone  else  postage 

due;  the  sender  pays  almost  nothing,  while  the  recipient  pays 

in  time  and  money  to  download  and  read  the  message.  With 

that  kind  of  incentive,  it's  surprising  that  only  60  to  80  percent 

of  e-mail  traffic  is  unsolicited  ads. 

Any  doubts  that  spam  is  the  biggest  problem  on  the  Net  were 

erased  in  February,  when  Bill  Gates  turned  it  into  a  keynote  topic 

at  RSA  Conference  2004.  As  usual,  rather  than  propose  a  new  idea, 

Microsoft's  chief  software  architect  gave  legs  to  existing  schemes. 

Gates'  first  proposal,  caller  ID  for  e-mail,  would  use  DNS  to  filter  mes- 
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'I  would  put  something  into  SMTP  that  required 
authentication  before  proceeding.' 


sages  from  forged  addresses  (see  "End- 
ing E-mail  Forgery,"  page  52).  A  more 
high-concept  Microsoft  research  pro- 
ject called  Penny  Black  would  require 
e-mail  users  to  attach  e-stamps  to  mes- 
sages before  sending  them  to  strangers 
—  the  stamps  would  be  cryptographic 
tokens  bought  not  with  cash,  but  with 
10  seconds  of  CPU  time.  Clever,  but 
hackers  are  already  cooking  up  ways  to 
cheat  the  system  (see  "How  Spammers 
Beat  the  System,"  page  46). 

Whenever  Gates  shows  up,  you  know 
the  tipping  point  has  arrived.  Instead  of 
tinkering  with  ever  more  complex  anti- 
spam  filters  and  gateways,  it's  time  to 

Your  Future  in  Spam 


rethink  the  way  e-mail  works  in  the 
enterprise.  With  that  in  mind,  we 
rounded  up  a  half  dozen  successful  soft- 
ware entrepreneurs  —  plus  one  unre- 
pentant spammer  —  and  asked  them 
how  they  would  change  the  system  to 
remove  mass-marketers'  incentives  to 
flood  your  workplace  with  ads. 

Our  six  experts  gave  us  sLx  different 
answers.  But  all  of  them  agreed  that 
positive  identification,  rather  than 
rejiggered  economics,  is  the  key  to 
clearing  the  clutter  from  the  e-mail 
channel  in  the  enterprise.  To  be  clear: 
Privacy  and  anonymity  are  values 
worth  preserving  on  the  Internet.  In 


Today'  s  spam  costs  are  bad  enough:  Roughly  $50  per  user  annually,  according  to  The  Radicati 
Croup.  Without  anti-spam  measures,  an  anticipated  doubling  of  corporate  spam  messages  will 
result  in  a  fivefold  increase  in  cost  —taking  into  account  hardware,  software,  maintenance, 
administration,  migration,  downtime,  and  training. 

The  Floodgates  Open  Wider ... 

■  Total  corporate  messages       Spam  messages 
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the  workplace,  though,  the  rules  are 
different.  As  one  of  our  panelists  put  it, 
the  rules  are  different.  No  one  should 
be  prevented  from  posting  personal 
opinions  anonymously,  but  you'd  have 
to  be  crazy  to  do  business  with  some- 
one whose  identity  can't  be  verified. 


FromlTRiCALLMAN    ~ 
Subject: |  REDESIGN  SMTP 


2003  2006  2007 


d  on  pnce  of  additional  servers  processing  spam  messages  for  a 


Before  getting  too  blue-sky  on  e-mail, 
we  decided  to  take  a  look  under  the 
hood  at  the  current  system.  As  the 
author  of  Sendmail,  the  program  that's 
served  as  the  Net's  primary  mail  trans- 
fer agent  for  more  than  two  decades, 
Eric  Allman  has  definite  ideas  on  what 
he'd  do  differently  were  he  to  start  on 
the  program  today,  rather  than  in  1.981 
when  he  coded  the  first  version  as  a 
student  at  the  University  of  California, 
Berkeley.  "The  thing  that  made  e-mail 
so  great  was  that  it  was  completely  out 
of  control,"  he  tells  InfoWorld.  "But 
everyone  was  working  toward  a  com- 
mon goal." 

If  he  could  start  over,  Allman  would 
retool  the  existing  protocols  with  the 
benefit  of  hindsight,  instead  of  throw- 
ing them  out  completely.  "The  first 
thing  I'd  say  is  we  had  not  anticipated 
the  security  needs,"  Allman  says. 
"Authentication  should  just  be  built  in." 

Rather  than  focus  on  DNS-based 
authentication,  Allman  would  choose  a 
cryptographic  solution.  "I  would  put 
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'We  need  to  have  an  ingrained  metadata 
structure  beyond  these  silly  X-headers.' 


-  Eric  Hahn,  Proofpoint 


something  into  SMTP  that  required 
authentication  before  proceeding,  just 
as  we  have  with  POP.  It's  a  bit  harder 
than  that  because  unlike  POP,  SMTP 
connections  may  not  have  any  prior 
relationship,  so  things  like  shared 
secrets  are  out  of  the  question." 
Allman's  dream  solution  includes  an 


Intemetwide  standard  domain-authen- 
tication mechanism.  "This  would  be 
part  of  an  optional  standard  connection 
initiation  protocol,"  he  says,  "so  we 
wouldn't  have  to  reinvent  authentica- 
tion for  each  and  every  use." 

Over  the  past  two  decades,  Allman's 
views  on  privacy  haven't  changed.  He 


still  believes  it's  a  necessity,  but  he's 
developed  a  more  sophisticated  view  of 
how  to  implement  it.  "I  used  to  feel 
anonymity  in  the  base  protocol  was 
important,"  he  says.  "But  if  someone 
brought  up  an  anonymity  server  that 
would  do  re-mailings  for  you,  that 
would  allow  this.  The  trick,  of  course,  is 


He  Told  You  So:  ]on  Postel  (1943  - 1998) 

ONE  OF  THE  internet's  FIRST  ARCHITECTS, 

jon  Postel,  helped  launch  the  first  ARPAnet  con- 
nection in  1969  as  a  Ph.D.  student  at  UCLA.  At 
the  time,  ARPAnet  was  restricted  to  research 
sites  and  funded  under  the  federal  Advanced 
Research  Projects  Agency. 

In  1975,  Postel  published  a  paper  for 
ARPAnet's  ad-hoc  Network  Working  Group. 
Numbered  RFC  706,  Postel's  essay  was  more 
memorably  titled  "On  the  junk  Mail  Problem."  In 
it,  he  identified  an  Achilles  heel  in  the  architec- 
ture that  would  eventually  be  used  to  build  the 
Internet: 

"In  the  ARPA  Network  Host/IMP  interface 
protocol  there  is  no  mechanism  for  the  Host  to 
selectively  refuse  messages.  This  means  that  a 
Host  which  desires  to  receive  some  particular 
messages  must  read  all  messages  addressed  to  it.  Such  a 
Host  could  be  sent  many  messages  by  a  malfunctioning 
Host.  This  would  constitute  a  denial  of  service  to  the  nor- 
mal users  of  this  Host.  Both  the  local  users  and  the  net- 
work communication  could  suffer.  The  services  denied  are 
the  processor  time  consumed  in  examining  the  undesired 
messages  and  rejecting  them,  and  the  loss  of  network 
thruput  or  increased  delay  due  to  the  unnecessary  busyness 
of  the  network." 

Postel's  proposed  solution  has,  over  the  years,  proved 
easier  said  than  done.  "It  would  be  useful  for  a  Host  to  be 
able  to  decline  messages  from  sources  it  believes  are  mis- 
behaving or  are  simply  annoying." 

Of  course,  it  probably  never  occurred  to  Postel  —  or  to 
anyone  else  in  1975  —  that  e-mail's  vulnerabilities  would 
be  exploited  not  by  malfunctioning  computers  but  by  direct- 
marketing  entrepreneurs. 


When  the  U.S.  government  opened  up  the  Internet  to  com- 
mercial use  in  1 995  —  a  year  after  the  law  firm  Canter  &  Siegel 
launched  the  first  big  spam  campaign  on  Usenet  —  Postel's 
"junk  mail  problem"  took  on  a  whole  new  meaning.  As  did 
everything  else  on  the  Net,  leading  to  lawsuits  against  Postel  by 
businesses  dissatisfied  with  his  role  in  assigning  Internet 
domain  names  and  address  numbers,  in  the  '70s,  designing  for 
more  than  four  billion  Internet  addresses  had  seemed  ample 
overkill. 

Six  years  after  Postel's  death,  the  fight  over  who  controls 
the  Internet  continues  to  grow  almost  as  rapidly  as  the  vol- 
ume of  spam.  Last  fall,  the  United  Nations  attempted  to  seize 
control  of  the  Internet  Assigned  Numbers  Authority,  which  had 
once  consisted  mostly  of  jon  Postel.  RFC  706  is  now  a  museum 
piece,  a  nostalgic  artifact  of  an  earlier,  noncommercial  Net  on 
which  most  problems  were,  well,  academic. 
—  P.B. 
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'Saying  I  like  challenge-response  systems 
is  like  saying  I  like  duct  tape.' 


to  avoid  abuse  —  this  could  perhaps  be 
done  by  having  explicitly  tagged 
addresses  that  are  willing  to  receive 
anonymous  mail.  Whistle-blower 
addresses,  investigative  reporters,  and 
so  on  might  be  willing  to  receive  arbi- 
trary anonymous  messages,"  using 
servers  that  don't  keep  any  logs  that 
could  be  subpoenaed. 

Allman  thinks  that  problems  with 
e-mail  today  e.xtend  beyond  unsolicited 
ads.  "There  are  lots  of  definitions  of 
garbage,"  he  says.  "Spam  is  just  the 
worst  one.  I  know  several  people  who've 
just  given  up  on  e-mail.  They've  gone 
back  to  having  'their  person'  do  it.  It's 
not  just  spam,  it's  also  the  continuous, 
'Gee,  can  you  help  me  on  this?'  No  mat- 
ter how  big  a  shovel  you  have,  you  can't 
get  rid  of  it." 


From:    BILL  WARNER 


Subject: |  IDENTIFY  YOURSELF      | 

"Saying  I  like  challenge-response  sys- 
tems is  like  saying  I  like  duct  tape,"  says 
Bill  Warner,  whose  frustration  with 
endless  rounds  of  phone  tag  led  to  his 
development  of  the  Wildfire  voice  sys- 
tem in  the  1990s.  Warner  runs  his  own 
challenge-response  server  to  kill  incom- 
ing spam  but  would  rather  see  the  sys- 
tem redesigned  more  along  the  lines  of 
the  U.S.  Postal  Service  —  not  meaning 
the  government  would  run  it,  but  that 
there  would  be  some  people-centric 
checks  on  identity  and  abuse. 


"It  comes  back  to  authentication," 
Warner  says.  "If  you  want  to  put  a 
server  on  the  system  and  use  DNS, 
you've  got  to  find  your  way  into  DNS 
somehow.  We've  managed  to  build  a 
network  of  millions  of  servers  around 
the  world  with  a  fairly  open  and  clear 
process  of  registering  for  it.  Why  can't 
we  do  that  with  e-mail?" 

Warner  isn't  talking  about  validat- 
ing sender  IP  addresses,  but  instead 
having  some  idea  of  who's  behind 
them.  "Part  of  the  problem  is  e-mail 
creates  a  large  scale  of  anonymity.  The 


postal  service  doesn't  have  that  prob- 
lem. You  can  send  e-mail  through  the 
postal  service,  and  it  doesn't  get  more 
than  a  postmark.  But  you  don't  get  to 
drop  a  million  messages  in  the  system. 
If  you're  a  big  mailer,  you're  going  to 
be  known.  If  you  deliver  a  million 
pieces  of  mail  to  the  post  office,  they're 
going  to  know  who's  doing  it,"  and 
they're  legally  obligated  to  deliver 
them  all. 

In  short,  Warner  thinks  that  instead 
of  focusing  on  caller  ID  schemes  that 
identify  servers,  we  should  reach  past 


How  Spammers  Beat  the  System 

THE    HOTTEST  TOPICS    IN    SPAM-FIC  HTI N  C   TODAY   ARE   COMPUTATIONAL 

solutions.  These  methods  require  e-mail  senders  to  burn  CPU  time  on  their  own 
computers  to  create  e-stamps  they  must  attach  individually  to  each  message  sent 
to  strangers.  The  Penny  Black  project  at  Microsoft  research  Is  the  best  known. 
Separately,  programmer  Adam  Back  maintains  a  thorough  FAQ  on  the  topic  at 
hashcash.org. 

Computational  schemes  work  like  this:  If  your  computer  sends  to  my  server  a 
message  addressed  to  me  and  you're  not  on  my  list  of  trusted  correspondents,  my 
server  sends  a  math  problem  back  to  your  mailer,  which  must  provide  the  answer  That 
problem  is  a  function  that  uses  my  address,  yours,  and  possibly  information  about 
your  specific  message  as  input  parameters.  Your  computer  must  calculate  the 
function  separately  for  every  single  recipient  to  whom  you  try  to  send  a  message.  The 
function  is  designed  to  take  a  finite  amount  of  time  —  say,  10  seconds  —  to  calcu- 
late, regardless  of  your  CPU  speed  or  other  hardware. 

The  premise  behind  the  plan  is  that  if  you  want  to  send  me  a  personal  mes- 
sage, you  won't  mind  a  10-second  delay  Your  e-mail  client  might  even  begin  the 
negotiation  and  calculation  process  while  you're  composing  the  message.  But  if 
you're  a  spammer  trying  to  send  1 0,000  messages  or  more  per  minute  before  your 
account  is  identified  and  shut  down,  you'll  need  a  rack  of  servers  to  do  the  com- 
putations sent  back  to  you  by  recipients  of  your  message.  Hopefully,  the  cost 
would  make  spamming  unprofitable. 

But  just  as  spammers  hacked  their  way  around  blacklists  and  Bayesian  filters, 
they're  more  likely  to  find  work-arounds  to  Penny  Black  than  go  out  of  business. 
Hackers  on  several  mailing  lists  have  already  identified  one  big  hole  in  the  plan: 
Many  spammers  already  send  their  messages  using  unsuspecting  people's  PCs, 
which  they've  hijacked  by  exploiting  security  holes.  What's  to  stop  them  from  break- 
ing into  other  computers  to  generate  their  stamps.' 
—  P.B. 
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Three  Steps  to  a  New  E-mail 


the  computer  to  identify  the  person 
sending  the  message.  "In  a  society 
founded  on  openness  and  transparency, 
one  of  the  fundamental  tenets  is  that 
people  can  be  identified.  A  person  is 
allowed  to  go  out  in  public  wearing  a 
mask.  But  no  one  will  give  them  a  job, 
and  no  one's  going  to  buy  anything 
from  them  in  a  store.  You're  not  going 
to  let  them  through  the  front  door  of 
your  business."  Same  with  e-mail.  "You 
still  have  ways  to  be  anonymous.  But 
someone  who  wants  to  get  in  the  door 
and  do  business  vAth  you  will  have  to 
take  the  mask  off." 


Framl  ERIC  HAHN 


Subject: |  XML  FOR  E-MAIL  | 

You  may  remember  Eric  Hahn  as 
Netscape's  CTO  or  as  a  member  of 
Red  Hat's  board  of  directors.  Today, 
Hahn  is  chairman  of  his  own  startup, 
Proofpoint,  which  sells  spam  filtering 
solutions  (in  fo  wo  rid.  CO  111 /I--")- 
Hahn  thinks  Proofpoint's  products 
are  just  the  first  instantiation  of  a 
much  larger  transition,  in  which 
e-mail  becomes  XML-encapsulated 
metadata. 

"Corporate  mail  processing  isn't 
about  just  spam  and  viruses,"  Hahn 
says.  "Most  companies  have  a  long  list 
of  things  they  want  to  see  true  about 
their  mail.  A  corporation  is  going  to 
need  to  do  n  things  to  each  e-mail 
message,  where  n  is  greater  than  two. 


n 


Identify  servers.  Use  DomainKeys 
to  verify  the  source  of  incoming 
messages.  If  that's  too  much  of  a 
challenge  to  roll  out,  use  DNS-based 
methods  such  as  the  emerging  RMX 
protocol.  Allow  mail  from  validated 
servers  through.  Give  a  lower  priority  to 
those  that  are  vague. 


How  are  you  going  to  do  the  next 
eight  things?"  Hahn  says  those  eight 
things  might  include: 

■  acceptable  use  policies 

■  regulatory  constraints  on  what  can  be 
e-mailed  inside  and  outside  the 
company 

■support  for  potential  litigation,  either 
as  plaintifFor  defendant 

■  intellectual  property  concerns 
■line-of-business  systems  integration 

issues,  such  as  employees  who  reply  to 

customers  outside  of  the  company's 

CRM  system. 

"Today,  e-mail  payloads  are 
essentially  opaque,"  Hahn  says.  But 
unlike  personal  e-mail  sent  to  and 
from  home  or  the  road,  e-mail  sent  on 
company  time  —  at  least  in  the 
United  States  —  legally  belongs  to  the 
company.  "The  next  generation  of  cor- 


Identify  users.  Every  major  e-mail 
client  now  supports  personal  digital 
signatures.  Your  sales  and  support 
staff  shouldn't  demand  customers  use 
them,  but  unsigned  messages  to  other 
employees  could  be  returned  for  authen- 
tication. It's  no  ruder  than,  "May  I  ask 
who's  calling.^" 


porate  messaging  architecture  will 
presume  applying  e-mail  applications 
to  every  message  that  goes  by,  just  as 
we  now  have  Web  applications."  It 
isn't  just  spam,  it's  the  Sarbanes-Oxley 
Act  of  2002,  which  steeply  raises  the 
bar  on  corporate  self-auditing,  that 
Hahn  says  requires  making  e-mail 
content  automatically  parsable. 

Does  this  mean  your  company  will 
be  reading  my  e-mail?  "Not  at  all," 
Hahn  says.  "When  we're  trading 
patient  records,  or  talking  about  a 
stock  trade,  we  shouldn't  have  to 
search  the  content.  We  should  be  able 
to  annotate  it,"  using  expandable, 
XML-driven  solutions  such  as  DRML 
(Data-entry  and  Report  Markup  Lan- 
guage). "We  need  to  have  an  ingrained 
metadata  structure  beyond  these  silly 
X-headers." 
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^  Move  crucial  communications 

i*^  out  of  e-mail.  Use  RSS  for  one- 
way, recurring  messages  such  as 
companywide  announcements,  and  set 
up  distributed  folders  or  work  spaces  for 
widely  shared  documents. 


From 


Irayozzie 


Subject: |  SHIFT  YOUR  PARADIGM  | 

Creator  of  Lotus  Notes,  the  groupware 
used  by  100  million  people,  Ray  Ozzie 
has  spent  years  studying  how  people  use 
their  inboxes.  His  current  company. 
Groove  Networks,  produces  software 
that  allows  people  inside  and  outside  an 
organization  to  share  workspaces  and 
files  over  a  secure,  peer-to-peer  connec- 
tion. But  Ozzie  is  aware  that  Groove's 


biggest  competitor  is  e-mail.  "For  most 
users  of  the  Internet,"  he  says,  "e-mail  is 
the  preferred  means  of  swapping  infor- 
mation —  whether  text  or  files  — 
because  it's  easy  to  use  and  it  usually 
works,  even  across  firewalls." 

Yet  Ozzie  feels  e-mail  has  been 
pushed  to  the  breaking  point,  past  the 
limits  of  its  original,  intended  purpose. 
"At  a  time  when  we  are  needing  new 
methods  to  cope  with  information  over- 
load, the  e-mail  paradigm  is  showing  its 
.'30-year-old  age,"  he  says,  resulting  in 
lower  and  lower  productivity  gains. 
"Not  only  are  there  the  obvious  issues  of 
spam  and  viruses;  it's  now  quite  com- 
mon that  large  files  and  common  file 
types  such  as  .doc  are  not  allowed  to 
pass  through  firewalls  because  of 
aggressive  IT  bandwidth,  storage,  and 
e-mail-filtering  policies." 

Ozzie  doesn't  claim  Groove  is  the 
solution  for  all  these  issues.  Rather,  it's 
one  part  of  a  strategy  to  move  work- 
place activities  out  of,  rather  than  into, 
e-mail.  "Rather  than  trying  to  cram  all 
sorts  of  new  things  into  e-mail,  we 
should  listen  to  what's  actually  happen- 
ing at  the  leading  edge  of  the  market: 
Instant  messaging  is  a  tremendously 
useful  paradigm  that  takes  interper- 
sonal communications  in  a  new  direc- 
tion. Sky|ie  [which  lets  PC  users  make 
phone  calls  to  each  other  over  the  Net  J 
sits  next  to  e-mail  quite  nicely,  thank 
you.  RSS  readers  and  aggregators  are 
showing  us  that  there  are  better  ways  to 
do  notifications  and  publish/subscribe 
than  filling  our  inbox."  Groove,  for  its 
part,  provides  a  security-wrapped 
workspace  for  collaboration  and  shared 
documents,  rather  than  keeping  them 
in  e-mail  folders. 

In  short,  Ozzie  has  no  interest  in 
re-inventing  e-mail.  "The  question,"  he 
says,  "is:  'What  new  and  more  appro- 
priate paradigms  will  emerge  to  reflect 


the  fact  that,  in  this  world  of  ubiqui- 
tous computing  and  communications, 
the  nature  of  work  is  fundamentally 
changing?" 


From:    DAVE  WINER 


Subject: PrSS  TO  THE  RESCUE    | 

As  one  of  the  Net's  lop  bloggers  and  a 
leading  contributor  to  the  RSS  standard 
for  online  content  syndication,  Dave 
Winer,  chairman  and  founder  of  User- 
land,  recently  reinvented  himself  —  as 
a  Harvard  fellow  at  the  law  school's 
Berkman  Center  for  Internet  and  Soci- 
ety. When  it  comes  to  rethinking  e-maU, 
Winer's  goal  is  the  same  as  Ray  Ozzie's 
but  from  the  opposite  direction. 

"You  have  to  go  up  a  few  levels," 
Winer  says.  "There  are  two  sides  to  it: 
reading  and  writing.  At  the  core,  RSS  is 
about  publishing.  It's  philosophically 
opposite  to  what  Ray  Ozzie  does.  Ray  is 
about  privacy  [for  shared  files  and  work 
spaces].  The  root  word  of  publish  is 
'public' " 

But  Winer  agrees  on  the  endgame: 
"E-mail  is  over  as  a  publishing  medi- 
um. You're  better  off  publishing  a 
Weblog  with  RSS  feeds  people  can 
subscribe  to."  For  one-way  informa- 
tion flows,  the  protocol  enables  com- 
panies to  set  up  archived,  searchable 
feeds  rather  than  leaving  it  to  employ- 
ees to  fish  old  messages  out  of  the 
inbox.  "You  can  subscribe  to  things 
created  by  other  workgroups  or  to  the 
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'E-mail  is  over  as  a  publishing  medium.  You're  better 
off  publishing  a  Weblog  with  RSS  feeds.' 


person  who  sends  around  e-mails  with 
Unks  to  articles,  saying  'you  gotta  read 
this.'  What  another  division  is  doing, 
what  your  competition  is  doing  — 
these  are  all  information  flows  in  a 
company  that  you  can  make  into 


feeds,  rather  than  mass  e-mailings." 

Having  seen  his  own  inbox  get  out  of 
hand,  one  of  Winer  s  design  goals  was 
to  keep  RSS  unspammable.  He  did 
that,  he  says,  by  making  sure  the  sys- 
tem stayed  opt-in  at  both  ends.  "Once 


someone  sends  you  something  you 
don't  want,  you  can  vote  them  out  with 
your  cursor."  If  you've  ever  tried  to 
unsubscribe  from  a  mailing  list  that 
just  keeps  coming,  you  know  the  prob- 
lem. "There's  one  RSS  publication  I 


Interview  With 


Direct  Marketer 
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YOU   ALMOST   CERTAINLY   COT  THE 

offer  in  your  inbox  a  year  ago  this  week: 
A  copy  of  the  "Iraq's  Most  Wanted" 
playing  cards  created  by  the  Pentagon. 
Yours  for  only  $5.95. 

One  of  the  entrepreneurs  hawking 
the  cards  was  OptlnRealBig.com 
founder  Scott  Richter,  who  sold  40,000 
decks  the  first  week  —  before  they 
were  even  printed. 

Richter  is  a  rarity:  a  spammer  willing 
to  show  his  face  and  give  interviews. 
Last  year,  Details  magazine  placed  him 
on  a  list  of  "Ten  Most  Influential  and 
Powerful  Men  Under  38,"  along  with 
Ben  Affleck  and  Eminem. 

Richter  is  quick  to  differentiate  him- 
self from  the  spammer  hordes,  however:  "What  we  do  is 
opt-in  e-mail  marketing,"  he  says  via  speakerphone  from 
his  Denver-area  office.  "We're  not  the  underground  spam- 
mer who  sends  out  the  Viagra  ad  spelled  incorrectly." 

Richter  is  currently  being  sued  by  the  State  of  New  York's 
Attorney  General  for  sending  advertisements  to  Hotmail 
members  that  violated  federal  deceptive  marketing  laws, 
including  misleading  subject  lines  and  faked  sender's 
addresses.  Richter  claims  a  subcontractor  sent  the  mes- 
sages. "Everything  we  send  out  has  our  address  on  it.  It  all 
comes  from  our  domain  space,"  he  says.  Either  way, 
Richter's  got  quite  an  operation  going.  The  New  York  suit 
claims  OptlnRealBig.com  is  the  third  most  prolific  ad 
mailer  on  the  Net.  Richter  doesn't  deny  that  the  company 
sends  50  to  250  million  messages  per  day. 

Charging  roughly  $200  per  million  messages,  the  com- 
pany pulls  in  two  million  dollars  a  month  in  revenue 
advertising  everything  from  Iraq's  Most  Wanted  to  the 
inevitable  male  supplements.  "As  long  as  it  doesn't  bring 


any  lawsuit,  I  don't  care,"  Richter  says.  "I'm 
not  going  to  send  out  child  porn,  but  we 
have  an  opt-in  list.  If  people  have  verified 
that  they  want  to  get  offers,  and  they're 
over  18,  we're  not  doing  anything  illegal." 
Whether  the  court  finds  that  true  or  not, 
the  CAN-SPAM  Act  of  2003  seems  to  be 
the  one  anti-spam  approach  Richter  likes. 
"Now  we  don't  have  50  states  passing  new 
laws  every  hour,"  he  says.  "But  the  only 
people  [prosecutors]  can  find  are  people 
in  the  United  States  doing  it  properly,"  he 
says.  "We  can't  catch  a  terrorist  with  a  $25 
million  bounty  on  his  head.  How  are  we 
gonna  catch  a  guy  on  a  dial-up  in  Turkey?" 
Moreover,  he  claims  anti-spammers 
aren't  being  up  front  about  their  own  busi- 
ness interests.  "Why  don't  they  have  a  do-not-mail  list  for  the 
U.S.  Postal  Service?  [Junk  mail]  kills  trees;  it  causes  trash  you 
have  to  pay  for.  The  difference  is  the  government  doesn't  get  a 
piece  of  the  action  for  e-mail.  And  what  would  Brightmail  and 
Postini  do  if  there  was  a  law  tomorrow  that  said  'no  more  spam 
or  you  go  to  jail?'  It's  a  political  game."  The  Radicati  Group 
estimates  that  in  2004,  spammers  will  book  $3.5  billion  in  rev- 
enues, but  the  market  for  anti-spam  software  will  come  close  to 
a  billion  dollars  itself 

Rather  than  just  hawking  filterware.  Bill  Gates  has  recently 
thrown  his  lot  in  with  academics  who  tout  computational  penal- 
ties —  such  as  10  seconds  of  CPU  time  for  each  unsolicited 
message.  Richter  says  that  would  still  let  him  sell  ads  for  pricier 
products.  Moreover,  he  doubts  the  plan  could  be  implemented. 
"How's  he  going  to  get  someone  in  China  to  play  to  his  values 
on  that?  If  it  were  that  easy,  we'd  finally  have  a  world  money 
system.  When  he  can  figure  out  how  to  keep  Windows  from 
crashing,  then  he  can  go  do  more  stuff." 
—  P.B 


50 


INFOWORLDCOI 


Traud  strikes  me  as  something  that  we 
should  put  people  into  jail  for. ...  Why  haven't 
we  used  normal  law  enforcement?' 


Brewster  Kahle,  Internet  Archiv 


subscribe  to  that  had  no  ads  in  it  when 
I  started,"  Winer  says.  "Then  they 
began  having  one  ad  per  day  in  the 
feed.  Now  practically  every  other  mes- 
sage is  an  ad.  I'll  be  unsubscribing 
soon.  One  click  and  they're  gone." 


From  ["bREWSTER  KAHlT 
Subject: [book 'EM! 


Serial  inventor  and  entrepreneur 
Kahle  created  one  of  the  first  Internet 
search  engines,  WAIS  (wide  area 
information  server),  and  then  built  a 
system,  Alexa,  for  tracking  Net  users' 
behavior  en  masse  and  sorting  Web 
sites  automatically  based  on  the  traf- 
fic. Now,  as  head  of  the  Internet 
Archive,  he  has  embarked  upon  a 
quest  to  build  the  modern  online 
equivalent  of  ancient  Egypt's  library  in 
Alexandria. 

Kahle  thinks  that  people  who  abuse 
the  basic  openness  of  the  Net  should 
simply  be  busted.  "Fraud  strikes  me  as 
something  that  we  should  put  people 
into  jail  for,"  he  says.  "If  someone  sends 
you  a  letter  saying,  'Hi,  I'm  Bill  Gates, 
and  I  want  to  sell  you  something,'  how 
would  that  be  greeted?  Right  —  as  a 
crime!  What  are  we  missing  here? 
What  happens  if  we  nail  the  top  100 
spammers?  Wliy  haven't  we  used  nor- 
mal law  enforcement?" 

Yet  Kahle  thinks  the  current  focus  on 
anti-spam  legislation  is  misdirected. 
"We  don't  have  to  reinvent  law.  We 


might  already  have  the  pieces  together 
that  we  need."  He  cites  the  Digital  Mil- 
lennium Copyright  Act  (DMCA)  as  an 
example  of  legislative  overreaction  to 
new  technology. 

Instead  of  passing  sweeping  new  laws 
like  the  DMCA,  Kahle  says,  "We  can 
just  apply  normal  law  to  this  situation. 
Look  at  what  happened  with  packaged 
software  in  the  early  '80s.  This  was  soft- 
ware that  was  valued  at  hundreds  of 
dollars  that  was  being  copied  for  free. 


They  tried  copy  protection.  They  tried 
to  create  all  these  technical  fixes.  It  did- 
n't work.  Instead,  they  fell  back  on  the 
law.  Now,  people  who  steal  expensive 
software  go  to  jail." 

How  would  that  apply  to  spammers? 
For  those  who  don't  use  their  real 
names  and  addresses,  Kahle  says,  Tou 
should  be  able  to  go  to  the  FBI  and  say 
'I  ley,  I'm  getting  forged  documents.' 
Will  this  stop  everything?  No,  but  it 
would  discourage  people  from  using 


Ending  E-mail  Forgery 


ALLTHE  EXPERTS  INTERVIEWED  FOR  THIS  ARTICLE  AGREE!  FIXING  E-MAIL  HINGES 

on  positive  identification  of  the  sender.  And  there  are  practical  solutions  on 
the  horizon  to  drastically  reduce  forgeries  that  characterize  some  of  the  worst 
e-mail  abuses. 

But  first,  some  background.  There  is  a  hole  in  SMTP  big  enough  to  drive  a  truck 
through  and  that's  what  spammers  routinely  do  when  they  forge  sender  addresses. 
Today  it's  trivial  to  send  a  message  that  seems  to  come  from  info@citibank.com  or 
support@microsoft.com.  Whether  it's  a  phisher  groping  for  access  to  a  bank  ac- 
count or  a  cyberterrorist  looking  to  compromise  a  PC,  both  the  company  whose 
identity  is  spoofed  and  the  individual  who  is  attacked  suffer  the  damage. 

Could  cryptographic  proof  of  identity  help?  Companies  that  run  secure  Web  sites 
know  how  to  acquire  the  server  certificates  that  assert  their  identities  and  enable 
SSL  connections.  They  could  also  use  client  certificates  to  sign  e-mail  messages 
and  probably  should.  But  individuals  rarely  have  or  use  digital  certificates,  so  e-mail 
culture  has  never  evolved  a  security  equivalent  of  the  Web's  ubiquitous  SSL  standard. 

In  our  July  18  feature,  "Canning  Spam"  (infoworld.com/246),  we  mentioned  an 
Internet  draft  proposal  from  Hadmut  Danisch,  called  RMX  (Reverse  Mail  Exchange) 
(infoworld.com/246).  The  idea  is  elegantly  simple:  In  addition  to  publishing  the  MX 
(Mail  Exchange)  DNS  records  that  identify  inbound  mail  hosts,  an  organization  also 
publishes  reverse  MX  records  that  identify  outbound  hosts.  A  receiving  server  queries 
the  DNS  to  find  out  if  the  sending  host  is  so  authorized.  The  name  yahoo.com  is  easy 
to  forge,  but  the  IP  addresses  of  Yahoo's  outbound  servers  are  not. 

The  devil's  always  in  the  details,  of  course.  It's  remarkably  difficult  to  define  exactly 
what  "sender"  means  in  today's  complex  e-mail  environment.  Three  current  pro- 
posals —  pobox. corn's  SPF  (Sender  Policy  Framework),  Microsoft's  Caller  ID  for 
E-Mail  (infoworld.com/1195),  and  Yahoo's  DomainKeys  —  take  different  approaches. 
SPF,  like  RMX,  focuses  on  the  "envelope  sender."  That's  the  MAIL  FROM  address 
asserted  by  the  sending  host  during  setup  of  an  SMTP  connection,  not  the  From: 
header  contained  in  the  body  of  the  message.  In  various  legitimate  cases  including 
mailing  lists,  mobile  messaging  services,  and  forwarders,  the  domains  of  the  two 
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fake  addresses.  We  just  haven't  made 
it  a  priority  to  crack  down  on  them." 

Kahle's  different  from  most  spam- 
bashers  in  that  he  thinks  onhne  adver- 
tising is  just  fine.  "It's  always  going  to 
be  a  mini-industry  to  advertise  to  peo- 
ple on  the  Net,"  he  says.  "And  I  don't 
think  we  should  make  everything 
completely  pristine,  because  a  lot  of 
good  ideas  come  from  the  shadows. 
We  just  want  to  know  when  we're 
dealing  with  the  shadowy  areas  of  the 


Net  and  when  we're  not." 

To  that  end,  one  of  Kahle's  proposals 
would  require  e-mail  senders  to  list  the 
jurisdiction  under  which  their  mes- 
sages are  sent.  "If  you  get  something 
from  the  .uk  domain,  you're  pretty 
clear  on  the  rules  its  sender  operates 
under  in  England.  But  if  the  mail  is 
from  .to,  you  might  not  know,  and  you 
could  be  a  lot  more  suspect  about  it.  It's 
the  same  reason  ships  fly  flags  of  dif- 
ferent countries." 


Internet  entrepreneurs  tend  to  be 
leery  of  government  involvement. 
Kahle,  by  contrast,  is  all  for  it,  citing 
Ben  Franklin's  30-year  role  in  shaping 
colonial  American's  postal  systems. 
"E-mail  is  as  important  now  as  the 
postal  system  was  in  the  Revolutionary 
days,"  Kahle  says.  "Why  aren't  we  tak- 
ing it  that  seriously  now?"  f", 
Paul  Boutin  is  a  Silicon  Valley  writer  who 
spent  15  years  as  a  software  engineer  and 
manager. 


addresses  can  differ.  A  companion  proposal  called  SRS  (Sender 
Rewriting  Scheme)  describes  how  to  modify  headers  in  transit  so 
that  intermediaries  aren't  seen  as  forgers  by  SPF-aware  hosts. 

Though  significantly  complicated  by  SRS,  the  SPF  approach  is 
arguably  a  good  first  line  of  defense.  It's  admirably  lightweight.  If  a 
message  fails  the  envelope  sender  test,  it  can  be  rejected  without 
receipt  or  inspection  of  its  contents.  But  messages  that  pass  this  test 
can  still  easily  forge  the  From:  header  in  the  message  itself,  and 
that's  the  address  that  mail  clients  display  to  users.  So  Caller  ID 
digs  into  the  message  itself  in  search  of  a  purported  responsible 
address  whose  domain  will  be  queried  for  authorization.  When  mul- 
tiple identities  are  involved,  as  with  forwarding,  the  Caller  ID  spec  rec- 


Of  the  three  schemes,  Sendmail  CTO  Eric  Allman  favors 
DomainKeys  because  it's  the  only  one  that  doesn't  "break  e-mail's 
fundamental  store-and-forward  model."  A  signed  message  that 
travels  through  a  forwarder  can  still  be  verified  by  the  receiver  with 
respect  to  the  originating  sender's  domain.  Since  intermediaries 
sometimes  rearrange  headers,  however  the  DomainKeys  archi- 
tects are  currently  having  a  lively  debate  about  which  headers  to 
sign  along  with  the  message  body. 

One  of  those  architects  is  the  prolific  open  source  developer 
Russell  Nelson.  He  sees  DomainKeys  as  a  near-term  fix  with 
long-term  growth  potential.  If  corporate  keys  can  be  woven 
into  the  DNS  fabric,  it  might  be  possible  —  using  a  Domain- 


It's  remarkably  difficult  to  defin^ 
exactly  what  'sender'  means  in  today's 
complex  e-mail  environment. 


ommends  that  mail  clients  display  the  addresses  of  intermediaries 
as  well  as  originators.  But  as  with  SPF,  the  authorization  check 
applies  only  the  party  "most  immediately  responsible  for  the  trans- 
mission of  a  message,"  not  to  the  originator. 

DomainKeys  focuses  squarely  on  the  From:  header  that  pur- 
ports to  identify  the  author  of  the  message.  The  domain  owner 
responsible  for  the  address  in  that  header  stores  a  public  key  in  a 
DNS  record  and  issues  the  corresponding  private  key  to  its  out- 
bound mail  servers  so  they  can  sign  outgoing  messages.  A 
receiving  host  looks  up  the  public  key  in  the  DNS  and  uses  it  to  ver- 
ify signatures  on  inbound  messages. 


Keys  feature  called  selectors  —  to  extend  support  for  individual 
keys  as  well. 

But  first  things  first.  We'll  soon  find  out  whether  any  of  these 
schemes  —  or  some  combination  of  them  —  will  prove  workable. 
If  you  run  an  e-mail  system,  you  can  best  prepare  for  the  anti- 
forgery  experiments  about  to  unfold  by  ensuring  that  your  home 
and  mobile  users  can  securely  contact  the  hosts  you'd  like  to 
authorize.  Now's  the  time  to  stop  procrastinating  about  the  virtual 
private  network,  and/or  enable  secure  connections  to  your  out- 
bound mail  hosts. 
—Jon  Udell 
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Al-Hussayen  pages  on  Internet  Archive 

Betsy  Z.  Russell  -  Staff  writer 

BOISE  _  An  Internet  archive  service  has  placed  all  but  two  of  the  300-pIus  past  web  pages  Sami  Omar 
Al-Hussayen  is  accused  of  using  to  promote  terrorism  on  its  free  archive  service,  available  to  anyone. 

"We  neither  select  content  nor  block  things  --  we  just  get  things  that  are  on  the  net  and  put  it  up,"  Molly  Beth 
Davis,  administrator  of  the  San  Francisco-based  Internet  Archive,  told  the  court  Thursday. 

The  archive  service's  mission  is  "universal  access  to  all  human  knowledge,"  she  said. 

David  Nevin,  AI-Hussayen's  lead  defense  attorney,  asked  her,  "And  just  so  we're  clear,  you  haven't  been  arrested 
or  anything?" 

Prosecutors  objected,  and  Nevin  had  to  withdraw  the  question.  But  after  the  trial  wrapped  up  for  the  afternoon,  as 
he  packed  up  his  stacks  of  documents,  Nevin  said,  "Is  it  a  crime  to  make  this  stuff  publicly  available  or  not?  If  it 
is,  let's  round  up  everybody  that's  doing  it,  stick  'em  in  jail." 

Davis  actually  was  a  witness  for  the  prosecution,  which  wants  to  use  more  than  300  printouts  of  past  Web  pages 
from  www.archive.org,  the  Internet  Archive's  site,  as  evidence  against  Al-Hussayen. 

The  defense  has  objected,  saying  the  printouts  aren't  a  reliable  enough  source  to  serve  as  evidence  in  court.  Web 
pages  change  frequently,  and  archive  programs  can  make  mistakes,  they  argued. 

Todd  Hinnen,  an  attorney  with  the  computer  crimes  and  intellectual  property  section  of  the  U.S.  Department  of 
Justice  in  Washington,  D.C.,  argued  that  the  printouts  are  reliable,  in  part  because  Internet  Archive's  contract 
with  a  related  company  that  gathers  the  pages  from  the  web  requires  accuracy. 

"It  is  part  of  the  business  relationship  between  the  two  organizations  that  that  machine  collects  only  true  and 
accurate  records,"  Hinnen  told  the  court. 

Davis  was  brought  in  to  describe  her  service  so  the  court  could  decide  whether  to  allow  the  evidence,  an  issue 
that's  still  pending.  With  her  blue-tinted  bangs,  tiny  dark-rimmed  glasses  and  expressive  mannerisms,  she  struck  a 
different  note  in  a  courtroom  that  had  been  dominated  for  most  of  the  day  by  painstaking  testimony  about  the 
purpose  of  various  lines  on  immigration  forms. 

Internet  Archive  is  a  nonprofit  library  service  that  has  catalogued  over  33  billion  Web  pages  going  back  to  1996, 
Davis  told  the  court.  Its  contractor  operates  a  Web  crawler  that  captures  snapshots  of  publicly  available, 
non-password  Web  pages  about  every  two  months,  then  stores  them  away. 

"We  donated  a  copy  to  the  library  at  Alexandria  in  Egypt,"  Davis  said,  to  ensure  the  archive  never  gets  lost. 

A  program  called  the  "wayback  machine"  allows  users  to  enter  a  time  and  date  for  a  particular  Web  site,  then  see 
the  way  it  looked  around  that  time. 

Davis  said  the  archive  service  is  like  a  library,  and  it  was  started  because  "we  want  to  save,  basically,  our  digital 
heritage,  average  life  span  of  a  Web  page  is  like  80  days.  We  want  future  historians  to  go  back  and  be  able  to  see 
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what  the  beginning  of  the  Internet  was  all  about." 

Nevin  maintains  that  extremist  messages  federal  authorities  found  on  some  Web  sites  linked  to  Al-Hussayen 
weren't  Al-Hussayen's  views,  and  in  some  cases  were  placed  on  the  sites  by  others. 

The  government  accuses  Al-Hussayen  of  intentionally  helping  terrorists  with  a  recruiting  and  fundraising 
campaign  over  the  Internet. 

Before  the  court  decides  whether  to  allow  the  Internet  Archive  printouts  as  evidence,  it  may  question  the  Web 
crawler  company  on  the  reliability  of  its  information-gathering.  Davis  said  she  could  guarantee  that  her  archive 
service  provides  true  and  accurate  copies  of  what  it  has  stored,  but  that's  all. 

"You  don't  know  what  Palestine-info.com  looked  like  on  a  particular  day,  right?"  Nevin  asked  her. 

"Of  course  not,"  she  said. 

•For  more  on  the  Al-Hussayen  trial,  see  our  Web  log  at  www.spokesmanreview.com/boise. 
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is  hard 


ANNA  PATTERSON,  STANFORD  UNIVERSITY 


There  must  be  4,000  programmers  typing  away  in  their 
basements  trying  to  build  the  next  "world's  most  scal- 
able" search  engine.  It  has  been  done  only  a  few  times.  It 
has  never  been  done  by  a  big  group;  always  one  to  four 
people  did  the  core  work,  and  the  big  team  came  on  to 
build  the  elaborations  and  the  production  infrastructure. 
Why  is  it  so  hard?  We  are  going  to  delve  a  bit  into  the 
various  issues  to  consider  when  writing  a  search  engine. 
This  article  is  aimed  at  those  individuals  or  small  groups 
that  are  considering  this  endeavor  for  their  Web  site  or 
intranet.  It  is  fun,  but  a  word  of  caution:  not  only  is  it 
difficult,  but  you  need  two  commodities  in  short  sup- 
ply— time  and  patience. 

SUPER-SHORT  SEARCH  ENGINE  OVERVIEW 
OK,  let's  do  it.  Let's  write  a  search  engine. 

A  crawler  gets  the  Web  pages  off  of  that  pesky  Web 
and  onto  your  beautiful  disks.  You'll  need  lots  of  disks. 

Then  you  need  to  index  these  pages— say  which  page 
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has  which  words.  This  will  tell  you  that  Janet  Jackson 
was  found  on  the  www.superbowl.com  page.  Usually, 
indexing  happens  locally  on  the  disks  where  your  crawler 
dumped  these  Web  pages.  Hey,  why  move  them? 

In  most  architectures,  now  you  need  to  merge  these 
indices  so  that  you  have  one  place  to  go  to  in  order  to 
find  all  the  pages  mentioning  Janet  Jackson's  Super  Bowl 
performance.  When  you  merge  all  these  small  indices, 
the  final  index  will  be  so  big  that  it  won't  fit  on  one 
machine.  This  means  that  you'll  have  to  merge  these 
small  indices  in  such  a  way  as  to  split  the  final  big  index 
across  many  machines. 

Now  you  are  ready  to  serve  queries?  Wrong.  Now  you 
build  the  runtime  system  that  gets  users'  queries,  retrieves 
the  results  out  of  the  index  from  the  right  machine(s), 
and  re-ranks  them  according  to  the  query.  All  this,  while 
people  are  drumming  their  fingers  on  their  desks  wait- 
ing—hopefully, lots  of  people  and,  hopefully,  not  enough 
time  for  much  drumming. 
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People  talk  a  lot  about  the  thousands  of  machines  needed 
to  build  a  search  engine.  This  sounds  very  scary.  All 
search  engines,  however,  started  with  a  lot  more  thought 
and  design  than  they  did  machines.  So  let's  see  what  is 
fact  and  what  is  fallacy. 

Bandwidth.  Legend  has  it  that  venture  capitalists  used 
to  buy  hard  disks  for  young  entrepreneurs  to  prove  that 
their  ideas  would  work.  Now  disks  are  cheap — but  the 
new  bottleneck  is  bandwidth.  Usually  that  takes  capital. 
You  need  this  bandwidth  to  get  the  pages  from  the  Web 
in  the  first  place.  The  "CPU-ness"  or  memory  of  the 
machines  that  you  use  doesn't  really  matter.  All  that  mat- 
ters is  how  much  bandwidth  you  have  (can  afford)  and 
can  use  because  crawling  is  not  a  CPU  endeavor — crawl- 
ing is  a  bandwidth  monster. 

There  are  lots  of  ways  around  this  issue,  but  the  most 
useful  is  to  realize  that  you  won't  get  the  indexer  and  the 
servers  working  right  (if  at  all)  for  six  months,  anyway,  so 
crawl  slowly  and  index  what  you  have  as  you  go  along. 
Bugs  will  show  up  in  the  later  phases,  so  the  lack  of  pages 
won't  be  the  thing  holding  you  up;  instead,  it'll  be  those 
nasty  bugs  slowing  you  down.  So  crawl  continuously  at 
whatever  rate  you  can  afford  (down  to  1-megabyte  DSL), 
and  the  rest  will  take  care  of  itself.  By  the  time  you  have  a 
search  engine  that  works  on  the  pages  you  have  and  can 
keep  up  with  your  super-slow  crawl,  perhaps  youTl  be  in  a 
position  to  afford  big  bandwidth  by  raising  capital. 

Big  bandwidth  is  usually  found  at  a  collocation  facility 
(or  colo).  1  want  to  warn  against  this  if  you  are  a  super- 
small  company.  Get  the  bandwidth  to  the  office!  If  you 
have  a  small  team,  the  last  thing  you  can  afford  is  people 
on  the  highway  all  day  long  running  to  the  colo.  This  is 
another  big  reason  that  1  recommend  small  bandwidth 
for  the  development  phase.  You  can't  afford  the  loss  of 
a  person  for  half  a  day  to  go  exchange  a  disk.  Another 
reason  to  avoid  a  colo  is  that  it's  hugely  expensive.  Just 
throw  the  stack  of  machines  under  your  desk  and  con- 
sider it  a  space  heater. 

CPU  Issues.  People  argue  all  day  about  which  types  of 
CPUs  to  use  for  which  phase  of  a  search  engine.  Most 
people  argue  that  the  ideal  is  to  get  stupid  CPUs  for  crawl- 


ing and  fast  CPUs  for  indexing  and  serving.  Why  is  this? 

You  don't  need  a  lot  of  thinking  to  do  crawling;  you 
need  bandwidth,  so  any  old  CPU  will  do.  For  indexing, 
you  are  doing  a  lot  of  I/O  and  a  lot  of  thinking/analyzing 
the  page,  so  the  bigger  the  better.  At  serve  time,  you're 
going  to  need  to  re-rank  the  URLs  in  response  to  a  query, 
so  again,  the  bigger  the  better. 

Since  you're  writing  the  search  engine  yourself, 
however,  it  has  to  be  one  size  fits  all.  Most  indexing 
algorithms  worth  their  salt  will  probably  peg  any  CPU.  So 
the  same  advice  goes:  it  doesn't  matter,  get  what  you  can 
afford;  the  bugs  you  write  will  slow  you  down  more  than 
the  cheap  CPUs.  If  you  have  to  look  around  your  local 
Fry's  or  CompUSA  for  CPUs,  however,  more  on-board 
cache  will  be  key  for  the  indexing  algorithms  because 
more  of  the  page  will  be  kept  onboard. 

If  your  algorithm  doesn't  peg  a  Pentium  4,  then 
rethink  the  game  plan  of  building  a  better  search  engine, 
because  yours  will  not  be  the  one  that  wins. 
Disk  issues.  SCSI  is  faster,  but  IDE  is  bigger  (and  cheaper). 
If  you  are  writing  a  search  engine  yourself,  use  IDE.  This 
will  save  money  in  many  ways.  You  get  bigger  disks,  so 
one  machine  can  hold  1  terabyte  for  IDE  disks  easily,  but 
this  just  isn't  the  case  for  SCSI.  Secondly,  SCSI  disks  are  a 
lot  more  expensive — also  not  a  good  idea  for  four  guys  in 
the  garage. 

At  runtime,  you'll  be  disk-bound.  You  have  two  tasks: 
get  the  index  entries  off  disk  and  re-rank  these  for  rel- 
evancy. For  getting  the  index  entries  off  disk,  you  might 
think  the  faster  the  disk  the  better.  But  users  will  not  see 
the  performance  increase  you  get  from  SCSI  in  the  disk 
transfer  rate,  because  it  takes  a  lot  of  practice  with  the 
search  engine  end  game  (the  runtime  architecture)  for 
this  difference  to  be  an  issue.  Instead,  use  parallelism  and 
multiple  cheap  disks  to  achieve  this  speed-up.  This  will 
still  save  you  money  in  buying  fewer  machines  and  give 
you  practice  with  the  key  tool  of  search  engine  architec- 
tures—parallelism. 

Ah,  but  SCSIs  are  hot-swappable,  you  say.  Get  over  it. 
Remember,  no  colo.  You  cannot  afford  it  and  you  don't 
want  it.  So  if  you're  worried  about  disk  failures  since  you 
picked  your  disks  out  of  a  Dumpster,  then  my  advice  is 
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don't  screw  the  covers  onto  your  machines  and  don't  use 
four  screws  per  disk.  This  makes  IDEs  pretty  easy  to  repair, 
but  certainly  not  hot-swappable. 

Storing  Files.  Old-fashioned  file  systems  used  to  have  a 
limit  on  file  size — some  of  them  had  a  2-gigabyte  limit. 
These  file  systems  also  used  to  have  an  issue  with  storing 
lots  and  lots  of  files  in  one  directory.  For  these  reasons, 
the  prevailing  wisdom  has  been  to  crawl  a  bunch  of  URLs 
and  stuff  them  into  one  big  file  (up  to  the  limit)  and  then 
start  on  the  next  file.  Even  though  current  operating  sys- 
tems don't  have  the  same  number-of-file  restrictions  they 
used  to,  putting  lots  of  pages  in  one  file  is  still  a  good 
idea.  Stuff  them  in — up  to  the  limit  of  good  performance 
of  your  operating  system. 

Why?  When  indexing,  or  laying  down  the  crawl,  a  big 
continuous  file  saves  a  whole  lot  of  disk  seeks — the  fewer 
files  the  better.  Disk  seeks  will  kill  you  even  if  your  disk 
transfer  rate  is  high.  You  cannot  afford  the  time  to  seek 
to  a  file  to  process  a  Web  page.  Web  pages  right  now  aver- 
age around  10  kilobytes  per  page  (I'm  such  an  oldtimer, 
1  remember  when  they  were  2  KB,  and  others  remember 
when  they  were  1  KB).  You  don't  want  to  seek  to  a  disk 
to  read  10  KB  when  we  are  talking  about  millions,  if  not 
billions,  of  Web  pages.  Essentially,  this  will  almost  double 
your  processing  time,  as  well  as  fry  your  disks  from  the 
Dumpster. 

While  you  might  think  that  it  is  conceptually  cleaner 
to  store  one  Web  page  per  one  file,  this  will  become  a 
management  pain — and  it  will  also  slow  down  your 
processing. 

Networiting.  With  real  estate  they  say  "location,  loca- 
tion, location."  Well,  a  good  search  engine  rule  that  I've 
learned  the  hard  way  is:  Don't  use  NFS.  Don't  use  NFS. 
Don't  use  NFS  (network  file  system).  NFS  might  seem  like 
a  great  idea  for  an  index  that  won't  fit  on  one  machine 
(and  yours  probably  won't).  It  seems  like  the  perfect  solu- 
tion. If  you  put  the  index  on  multiple  machines,  then 
NFS  will  make  it  seem  like  your  index  is  on  one  machine. 
Sound  good?  That  way  you  don't  have  to  do  or  learn  any 
networking  yourself.  Wrong!  You'll  have  to  do  real  distrib- 
uted systems  work  for  the  serving  architecture,  anyway,  so 
get  it  over  with  and  do  the  work  now. 


Current  NFS  implementations  can't  stand  the  punish- 
ment inflicted  by  the  runtime  system,  or  the  indexing 
phase  without  using  "spendy"  specialized  hardware. 

In  the  indexing  phase,  you  will  get  corrupted  indices 
as  you  try  to  do  lots  of  networked  writes.  Ask  the  con- 
tributors to  NFS  in  Linux  and  they  will  tell  you  the  same: 
not  ready  for  serious  punishment. 

Next,  using  NFS  in  the  runtime  system,  you  will  get 
machines  that  don't  have  fault  tolerance.  If  one  of  the 
NFS'd  machines  is  sick,  then  the  rest  just  seize.  Not  good. 

SOFTWARE  TO  WRITE/GET 

Crawler.  If  you  don't  use  an  open  source  crawler,  my 
advice  is  a  super-simple  multistep  crawler.  This  is  very 
important  advice  that  will  cut  months  off  your  develop- 
ment time,  so  if  you  ignore  everything  else,  don't  ignore 
this. 


If  you  have  a  Small  team, 

the  last  thing  you  can  af{g^. 

Is  people  on  the  highway 

all  day  long  running  to  the  colo. 


If  you  want  to  build  a  crawler  yourself,  then  first  get 
a  list  of  URLs  that  you  want  to  seed  your  crawler  with 
(these  need  to  be  good  starting  points  for  exploring  the 
Web — dmoz,  Yahoo...).  Then  write  any  simple  program 
that  will  get  them.  For  instance,  (delist  (y  list  of  URLs)  GET 
y)  is  essentially  all  you  need. 

When  you  get  these  pages,  analyze  the  outgoing  links 
in  the  pages  to  create  a  new  list  for  your  simple  crawler 
and  go  get  those.  What  about  duplicates,  you  ask?  Sort  | 
uniq  on  Linux  will  do  this  for  you;  otherwise,  1  think  you 
can  handle  it.  This  takes  care  of  duplicate  URLs,  but  what 
about  duplicate  content?  My  advice:  find  those  at  serve 
time. 

The  really  hard  problem  with  crawlers  is  to  perform 
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dynamic  duplicate  elimination — eliminating  both  dupli- 
cate URLs  and  duplicate  content.  With  the  system  that  1 
described,  we've  avoided  getting  a  Ph.D.  dissertation  and 
instead  have  some  piece  of  code  you  can  hand  off  to  your 
youngest  sibling. 

Indexing.  Next  you  need  to  churn  through  the  pages 
and  build  an  index.  This  is  tricky.  Just  don't  do  anything 
wrong,  as  the  saying  goes.  One  false  step  and  those  bil- 
lions of  pages  are  going  to  take  too  long  to  process  and 
your  1-MB  DSL  crawling  line  is  going  to  seem  fast. 

There  is  a  major  field  of  study  about  the  different 
things  to  index  on.  Don't  get  a  Ph.D.;  just  index  on 
words.  Words  are  what  people  search  for;  they  don't 
search  for  N-Grams  or  letters  or  PTrees  or  locations  in 
streams,  so  any  other  method  other  than  the  simplest  will 
make  you  seem  clever.  But,  hey,  writing  your  own  search 
engine  is  hard  enough.  Save  what  cleverness  you  own  for 
ranking. 

Two  other  pieces  of  key  advice:  First,  just  index  the 
data  you  need  to  serve  your  kind  of  search  results  and 
do  your  kind  of  ranking.  Don't  write  down  everything 
and  the  kitchen  sink — save  that  for  when  you  go  ultra- 
commercial.  The  first  item  of  business 
is  getting  something  presentable  up. 
Correction — start  by  getting  something 
up.  Find  out  what  went  wrong  and  fix  it. 

Second,  do  not  get  attached  to  the 
"index  format."  The  hallowed  "index 
format"  is  not  the  end  of  the  search 
engine,  it  is  just  the  beginning.  It  is 
a  tool  to  see  results,  so  change  it  and 
change  it  often.  Play  with  it,  and  you 
and  your  team  will  be  on  a  winner  to  be 
able  to  improve  search  results  quickly. 

Why  would  you  need  to  add  things 
to  the  index?  Perhaps  you've  just 
decided  that  it  would  be  good  idea  to 
keep  whether  the  indexed  word  is  in 
the  title.  So  now  you  need  a  space  to 
annotate  this  fact.  You  might  have  other 
ideas  that  mean  adding  more  data  to 
the  index. 


Now  that  you  have 
the  list  of  URLs, 
you  have  to  rank 
them  according  to 

your  relevancy 
algorithm. 

This  has  to  be  fast. 
People  are  waiting. 


Let's  say  that  you've  worked  in  the  long  dark  until  the 
proud  day  when  you  type  in  a  search  for  hug,  and  pages 
that  mention  Britney  Spears  but  not  bug  appear.  All  kinds 
of  things  like  that  happen.  Do  a  dance— you're  almost 
there.  Just  keep  fixing. 

A  last  word  of  advice:  when  in  the  development  phase, 
keep  a  disk-based  index  architecture.  You  are  not  getting 
lots  of  traffic,  you  want  flexibility  regarding  which  items 
to  place  in  the  index,  and  mostly  you  want  a  happy  team. 
A  happy  team  does  not  fight  over  bits.  A  happy  team  does 
not  see  whose  new  feature  is  in  and  whose  is  out  because 
there  isn't  enough  memory.  Buy  disks,  play  with  features, 
and  have  fun. 

Dynamic  versus  Static  Raniting.  Don't  do  page  rank  ini- 
tially. Actually  don't  do  it  at  all.  For  this  observation  I  risk 
being  inundated  with  hate  mail,  but  nonetheless  don't 
do  page  rank.  If  you  four  guys  in  your  garage  can't  get 
something  decent-looking  up  without  page  rank,  you're 
not  going  to  get  anything  decent  up  with  page  rank.  Use 
the  source,  Luke — the  HTML  source,  that  is. 

Page  rank  is  lengthy  analysis  of  a  global  nature  and  will 
cause  you  to  buy  more  machines  and  get  bogged  down  on 
this  one  complicated  step — this  one  fac- 
tor in  ranking.  Start  by  exploiting  every- 
thing else  you  can  think  of:  Is  the  word 
in  the  title?  Is  it  in  bold?  etc.  Spend  your 
time  thinking  about  anything  you  can 
exploit  and  try  it  out. 

This  again  will  give  you  the  freedom 
and  make  you  develop  an  architecture 
good  for  adding  things  and  trying  them 
out.  This  will  become  invaluable  later. 
Serving.  Runtime  systems  are  hard. 
Algorithms  are  hard.  The  hardest  part 
about  a  search  engine  is  that  you  have 
to  do  both.  They  have  to  work  together, 
and  both  parts  are  absolutely  critical. 
At  serve  time,  you  have  to  get  the 
results  out  of  the  index,  sort  them  as  per 
their  relevancy  to  the  query,  and  stick 
them  in  a  pretty  Web  page  and  return 
them. 
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If  it  sounds  easy,  then  you  haven't  written  a  search 
engine.  Remember,  first,  that  some  queries  have  more 
than  one  word.  This  means  that  you  have  to  intersect 
the  index  entries  for  the  two  words.  My  advice  is  to  have 
them  presorted  in  some  canonical  URL  number  order  so 
that  you  can  view  the  two  (n)  index  entries  as  two  staci<s 
and  pop  until  the  tops  are  equal,  in  which  case,  you  win 
the  prize— the  URL  is  in  both  index  entries.  These  sorts  of 
computations  have  to  be  run  at  query  time  and  they  need 
to  be  run  quickly,  so  think  hard  about  how  you  are  going 
to  do  intersections. 

Next  problem,  query  time  ranking.  Now  that  you  have 
the  list  of  URLs,  you  have  to  rank  them  according  to 
your  relevancy  algorithm.  This  has  to  be  fast.  People  are 
waiting. 

The  fastest  thing  to  do  at  runtime  is  pre-rank  and 
then  sort  according  to  the  pre-rank  part  of  your  index- 
ing structure.  This  often  results  in  generic  (read  not  the 
best  of  breed)  ranking  algorithms.  You  need  to  take  into 
account  the  actual  query  when  you  are  ranking.  Thus, 
you  need  some  data  in  your  index  to  help  take  the  query 
into  account  and  re-rank  your  a  priori  ranking  quickly  at 
runtime. 

For  the  basic  runtime  architecture,  you  will  find  no 
end  to  people  willing  to  argue  about  the  "appropriate" 
way  to  do  it.  In  practice,  there  are  two  basic  disk-based 
methods  and  other  memory-based  methods.  Since  we're 
doing  this  on  the  cheap,  we'll  cover  just  the  basic  disk- 
based  methods. 

The  first  major  method  is  this:  after  indexing  the  files 
locally — where  your  crawler  deposited  them — leave  the 
little  indices  there.  Yes,  do  nothing  more.  This  means  at 
runtime  you  ask  all  machines  that  have  answers  for  the 
appropriate  query  to  get  back  to  you  ASAP.  You  drum 
your  fingers  as  long  as  you  are  willing,  then  gather  these 
little  lists  into  a  big  list  and  sort  this  list  for  relevancy. 

The  other  method  is  to  gather  all  results  for  a  particu- 
lar word  together  in  a  big  list  beforehand.  Then  when  a 
query  arrives,  go  to  the  appropriate  machine,  get  the  list, 
and  then  sort  for  relevancy.  Without  showing  my  bias  too 
much,  look  on  the  bright  side:  for  rare  queries  or  obscure 
words,  these  are  equivalent. 


NO  ROOM  f  OR  ERROR 

When  you  look  at  all  these  steps  and  all  the  complica- 
tions, this  process  is  rife  with  things  that  go  can  wrong. 
The  hardest  part  about  writing  a  search  engine  is  that 
you're  going  to  process  billions  of  URLS  and  serve  mil- 
lions, if  not  billions,  of  queries.  This  does  not  leave  a  lot 
of  room  for  error.  One  super-linear  algorithm  applied  over 
the  wrong-sized  list  of  items  and  you  are  sunk.  One  lock 
inside  another  lock  and  you  are  sunk.  There  will  be  no 
code  paths  not  explored.  All  of  those  comments  in  your 
code,  which  print  out  errors  like  "This  will  never  hap- 
pen," will  happen. 

When  you  think  that  you  are  done,  there  is  still  the 
load  balancing,  the  caching,  the  DNS  servers,  the  ad 
service,  the  image  servers,  the  update  architecture,  and 
(to  take  off  on  a  familiar  tune)  a  cartridge  in  a  tape  drive. 
Oh,  and  if  you  would  like  to  hear  from  someone  who's 
already  done  it,  read  Mike  Cafarella  and  Doug  Cutting's 
article,  "Nutch:  Open  Source  Web  Search,"  on  page  54  of 
this  issue. 

Sadly,  the  biggest  thing  that  goes  wrong  while  writing 
your  own  search  engine  is  running  out  of  time.  Real  life 
often  interferes  and  forces  you  to  end  your  quest.  In  that 
case,  cheer  up;  once  the  search  bug  gets  you,  you'll  be 
back.  The  problem  isn't  getting  any  easier,  and  it  needs  all 
the  experience  anyone  can  muster.  Q 
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ANNA  PATTERSON  (anna@cs.stanford.edu)  has  written 
two  search  engines.  Most  recently  she  wrote  the  big- 
gest index  in  the  world  by  indexing  30  billion  Web  pages 
at  the  Internet  Archive  at  Recall.Archive.org.  In  1998 
she  coauthored  a  search  engine  at  Xift,  where  she  was 
a  founder.  She  received  her  Ph.D.  in  computer  science 
from  the  University  of  Illinois  at  Urbana-Champaign  and 
was  a  research  scientist  at  Stanford  University,  where 
she  worked  on  phenomenal  data  mining.  She  is  also  the 
mother  of  three  preschoolers,  who  let  her  hack  some- 
times. 
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Search  engines  are  as  critical  to  Internet  use  as  any  other 
part  of  the  network  infrastructure,  but  they  differ  from 
other  components  in  two  important  ways.  First,  their 
internal  workings  are  secret,  unlike,  say,  the  workings  of 
the  DNS  (domain  name  system).  Second,  they  hold  politi- 
cal and  cultural  power,  as  users  increasingly  rely  on  them 
to  navigate  online  content. 

When  so  many  rely  on  services  whose  internals  are 
closely  guarded,  the  possibilities  for  honest  mistakes, 
let  alone  abuse,  are  worrisome.  Further,  keeping  search- 
engine  algorithms  secret  means  that  further  advances 
in  the  area  become  less  likely.  Much  relevant  research  is 
kept  behind  corporate  walls,  and  useful  methods  remain 
largely  unknown. 

To  address  these  problems,  we  started  the  Nutch 
software  project,  an  open  source  search  engine  free  for 
anyone  to  download,  modify,  and  run,  either  as  an 
internal  intranet  search  engine  or  as  a  public  Web  search 
service.  As  you  may  have  just  read  in  Anna  Patterson's 
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"Why  Writing  Your  Own  Search  Engine  Is  Hard,"  writing 
a  search  engine  is  not  easy.  As  such,  our  article  focuses  on 
Nutch's  technical  challenges,  but  of  course  we  hope  Nutch 
will  offer  improvements  in  both  the  technical  and  social 
spheres.  By  enabling  more  people  to  run  search  engines, 
and  by  making  the  code  open,  we  hope  search  algorithms 
will  become  as  transparent  as  their  Importance  demands. 

TECHNICAL  CHALLENGES 

Much  of  the  challenge  in  designing  a  search  engine  is 
making  it  scale.  Writing  a  Web  crawler  that  can  download 
a  handful  of  pages  is  straightforward,  but  writing  one  that 
can  regularly  download  the  Web's  nearly  5  billion  pages  is 
much  harder. 

Further,  a  search  engine  must  be  able  to  process 
queries  efficiently.  Requirements  vary  widely  with  site 
popularity:  a  search  engine  may  receive  anywhere  from 
less  than  one  to  hundreds  of  searches  per  second. 

Finally,  unlike  many  software  projects,  search  engines 
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can  have  high  ongoing  costs.  They  may  require  lots  of 
hardware  that  consumes  lots  of  Internet  bandwidth  and 
electricity.  We  discuss  deployment  costs  in  more  detail  in 
the  next  section,  but  for  now  it's  helpful  to  keep  in  mind 
a  few  ideas: 

•  The  cost  of  one  part  of  the  search  engine  scales  with  the 
size  of  the  document  collection.  The  collection  might 
be  very  small  when  Nutch  is  searching  a  single  intranet, 
but  could  be  as  large  as  the  Web  itself. 

•  Another  part  of  the  search  engine  scales  with  the  size 
of  the  query  load.  Each  query  takes  a  certain  amount  of 
time  to  process  and  consumes  some  bandwidth. 

•  With  these  two  factors  in  mind,  we've  designed  a  sys- 
tem that  can  easily  distribute  the  work  of  both  fetching 
and  query  processing  over  a  set  of  standard  machines. 

Figure  1  shows  the  system's  components. 
WebDB.  WebDB  is  a  persistent  custom  database  that 
tracks  every  known  page  and  relevant  link.  It  maintains 
a  small  set  of  facts  about  each,  such  as  the  last-crawled 
date.  WebDB  is  meant  to  exist  for  a  long  time,  across 
many  months  of  operation. 

Since  WebDB  knows  when  each  link  was  last  fetched, 
it  can  easily  generate  a  set  of  fetchlists.  These  lists  con- 
tain every  URL  we're  interested  in  downloading.  WebDB 
splits  the  overall  workload  into  several  lists,  one  for  each 
fetcher  process.  URLs  are  distributed  almost  randomly; 
all  the  links  for  a  single  domain  are  fetched  by  the  same 
process,  so  it  can  obey  politeness  constraints. 

The  fetchers  consume  the  fetchlists  and  start  download- 
ing from  the  Internet.  The  fetchers  are  "polite,"  mean- 
ing they  don't  overload  a  single  site  with  requests,  and 
they  observe  the  Robots  Exclusion  Protocol.  (This  allows 
Web-site  owners  to  mark  parts  of  the  site  as  off-limits  to 
automated  clients  such  as  our  fetcher.)  Otherwise,  the 
fetcher  blindly  marches  down  the  fetchlist,  writing  down 
the  resulting  downloaded  text. 

Fetchers  output  WebDB  updates  and  Web  content.  The 
updates  tell  WebDB  about  pages  that  have  appeared  or 
disappeared  since  the  last  fetch  attempt.  The  Web  content 
is  used  to  generate  the  searchable  index  that  users  will 
actually  query. 

Note  that  the  WebDB-fetch  cycle  is  designed  to  repeat 
forever,  maintaining  an  up-to-date  image  of  the  Web 
graph. 
Indexing  and  Querying.  Once  we  have  the  Web  con- 


tent, Nutch  can  get  ready  to  process  queries.  The  indexer 
uses  the  content  to  generate  an  inverted  index  of  all 
terms  and  all  pages.  We  divide  the  document  set  into  a  set 
of  index  segments,  each  of  which  is  fed  to  a  single  searcher 
process. 

We  can  thus  distribute  the  current  set  of  index  seg- 
ments over  an  arbitrary  number  of  searcher  processes, 
allowing  us  to  scale  easily  with  the  query  load.  Further, 
we  can  copy  an  index  segment  to  multiple  machines 
and  run  a  searcher  over  each  one;  that  allows  more  good 
scaling  behavior  and  reliability  in  case  one  or  more  of  the 
searcher  machines  fail. 

Each  searcher  also  draws  upon  the  Web  content  from 
earlier,  so  it  can  provide  a  cached  copy  of  any  Web  page. 

Finally,  a  pool  of  Web  servers  handle  interactions  with 
users  and  contact  the  searchers  for  results.  Each  Web 
server  interacts  with  many  different  searchers  to  learn 
about  the  entire  document  set.  In  this  way,  the  Web 
server  is  simultaneously  acting  as  an  HTTP  server  and  a 
Nutch-search  client. 

Web  servers  contain  very  little  state  and  can  be  easily 
reproduced  to  handle  increased  load.  They  need  to  be 
told  only  about  the  existing  pool  of  searcher  machines. 
The  only  state  they  do  maintain  is  a  list  of  which  searcher 
processes  are  available  at  any  time;  if  a  given  segment's 
searcher  fails,  the  Web  server  will  query  a  different  one 
instead. 

Quality.  Generating  high-quality  results,  of  course,  is  the 
most  important  barrier  for  Nutch  to  overcome.  If  it  can- 
not find  relevant  pages  as  well  as  commercial  engines  do, 
Nutch  isn't  much  use.  But  how  can  it  ever  compete  with 
large,  paid  engineering  staffs? 

•  First,  we  believe  high-quality  search  is  a  slowing  target. 
By  some  measures  of  quality,  the  gap  between  the  best 
search  engine  and  its  competitors  has  narrowed  con- 
siderably. After  several  years  of  intense  focus  on  search 
results,  anecdotal  evidence  suggests  gains  in  quality  are 
harder  to  find.  The  everyday  search  user  will  find  lots  of 
new  features  on  the  various  engines,  but  real  differences 
in  results  quality  are  close  to  imperceptible. 

•  Second,  although  much  search  work  takes  place 
behind  corporate  walls,  there  is  still  a  fair  amount  of 
public  academic  work.  Many  of  the  techniques  that 
search  engines  use  were  discovered  by  IR  (information 
retrieval)  researchers  in  the  1970s.  Some  people  have 
tried  to  tie  IR  in  with  advances  in  language  understand- 
ing. With  the  advent  of  the  Web,  many  different  groups 
experimented  with  link-driven  methods.  We  think  there 
should  be  more  public  research,  but  there  is  already  a 
good  amount  to  draw  upon. 
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•  Third,  we  expect  that  Nutch  will  be  able  to  incorporate 
academic  advances  faster  than  any  other  engine  can. 
We  think  researchers  and  engineers  will  find  Nutch 
very  appealing.  If  it  becomes  the  easiest  platform  for 
researchers  to  experiment  on,  taking  advantage  of  the 
results  should  be  extremely  simple. 

•  Finally,  we'll  rely  on  the  traditional  advantages  of  open 
source  projects.  More  people  from  more  places  should 
work  on  Nutch,  which  means  faster  bug  finding,  more 
ideas,  and  better  implementations.  In  the  long  term,  a 
worldwide  shared  effort  supported  by  research  at  a  num- 
ber of  institutions  should  eventually  be  able  to  surpass 
the  private  efforts  of  any  company. 

Once  an  open  source  search  solution  is  as  good  or 
better  than  proprietary  implementations,  there  should  be 
little  reason  for  companies  to  use  anything  but  the  open 
source  version.  It  will  be  cheaper  to  maintain  and  work  as 
well. 

Spam.  A  high  search  ranking  can  be  extremely  valuable 
to  a  Web-site  owner — so  valuable  that  many  sites  try  to 
"spam"  search  engines  with  specially  formulated  content 
in  an  effort  to  raise  their  rankings.  As  with  e-mail  spam, 
the  spammer  can  benefit  at  a  heavy  cost  to  everyday  users. 

How  does  this  work  in  practice?  Search  engines  tend 
to  use  a  well-known  set  of  guidelines  to  measure  a  page's 
relevance  to  a  given  query.  For  example,  all  other  things 
being  equal,  a  page  that  contains  the  word  parrot  10  times 
is  more  about  parrots  than  a  page  that  has  the  word  just 
once.  A  page  with  lots  of  incoming  links  from  other  sites 
is  more  important  than  a  page  with  fewer  incoming  links. 

That  means  it  can  be  fairly  easy  to  trick  a  naive  search 


engine.  Want  to  make  sure  every  parrot  lover  finds  your 
page?  Repeat  the  word  parrot  600  times  somewhere  on 
your  page.  Want  to  raise  your  page's  in-link  count?  Pay  a 
type  of  site  known  as  a  "link  farm"  to  add  thousands  of 
links  aimed  at  your  page. 

Of  course,  the  consequence  is  that  search  results  can 
become  choked  with  sites  that  are  not  truly  relevant, 
but  have  "gamed"  the  system  successfully.  Good  search 
engines  don't  want  their  results  to  become  useless,  so 
they  do  everything  possible  to  detect  these  spam  tricks. 
Spammers,  in  turn,  modify  their  tricks  to  avoid  detection. 
The  result  is  an  arms  race  between  search  engine  and 
spammer. 

Here  are  some  well-known  spam  techniques,  along 
with  methods  to  defeat  them: 

•  Web  sites  write  documents  that  contains  long  repeti- 
tions of  certain  words.  Search  engines  counter  by  elimi- 
nating terms  that  appear  consecutively  more  than  a  certain 
number  of  times. 

•  Web  sites  do  the  same  trick,  but  intersperse  the  repeated 
term  along  with  good-looking  intervening  text.  Search 
engines  counter  by  checking  whether  the  statistical  distribu- 
tion of  the  words  in  the  document  matches  the  typical  Eng- 
lish-language profile.  If  it's  too  far  apeld,  the  site  is  marked 
as  a  spammer. 

•  Web  sites  that  want  high  rankings  regardless  of  query 
put  spurious  "invisible"  text  on  the  page.  Say  the  site 
offers  a  page  about  electronics,  all  rendered  on  a  white 
background.  The  very  same  page  might  contain  a  long 
essay  about,  say,  Britney  Spears,  all  rendered  in  white 
text.  Users  won't  see  it,  but  the  search  engine  will. 

Search  engines  counter  by 
computing  the  visible  portion 
of  the  HTML  and  tossing  the 
rest,  or  even  by  penalizing 
pages  that  use  any  invisible 
text. 

•  Web  sites  use  the  "User- 
Agent"  tag  to  identify  the 
type  of  browser.  If  the 
browser  is  a  piece  of  desk- 
top software,  the  Web  site 
returns  regular  content. 
If  the  browser  is  a  crawler 
for  a  search  engine,  the 
Web  site  returns  differ- 
ent content  that  contains 
thousands  of  repetitions  of 
parrot.  Search  engines  fight 
against  this  by  penalizing 
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sites  that  give  substantially  different  content  for  different 
browser  types. 
•  Web  sites  use  link  farms  to  add  to  incoming  link  count. 
Search  engines  find  link  farms  by  looking  for  statistically 
unusual  link  structures.  The  link  farms  are  thrown  away 
before  computing  link  counts.  Pages  that  participate  in  the 
farm  may  also  be  penalized. 
Some  of  these  methods  may  rely  on  secrecy  for  their 
effectiveness,  so  some  people  ask  how  an  open  source 
engine  could  possibly  handle  spam.  With  full  disclosure 
of  code,  won't  a  search  engine  lose  the  fight? 

It's  true  that  Nutch  code  won't  hold  any  secrets.  But 
these  secrets  are  brittle  anyway— spammers  don't  take 
long  to  defeat  the  latest  defense.  If  search  has  to  rely  on 
secrecy  to  beat  spam,  the  spammers  will  probably  win. 

In  the  world  of  e-mail  spam,  at  least,  the  days  of  sim- 
ple methods  to  defeat  spammers  seem  to  be  over.  Many 
of  the  latest  techniques  to  defeat  e-mail  spam  are  statistics 
driven.  With  such  methods,  even  intimate  knowledge  of 
the  source  code  may  not  help  spammers  much.  Although 
people  may  be  reluctant  to  use  such  probabilistic  spam 
detectors  on  e-mail  for  fear  of  deleting  a  single  good  mes- 
sage, the  massive  redundancy  of  Web  information  means 
false  positives  are  not  so  great  a  tragedy. 

Alternatively,  the  answer  may  lie  in  an  analogy  to 
cryptography.  It  has  taken  a  long  time  for  people  to  learn 
the  counterintuitive  notion  that  the  most  secure  cryp- 
tographic systems  are  those  that  have  the  most  public 
scrutiny.  Most  people  who  look  at  these  systems  are 
well  motivated  and  work  to  improve  them  rather  than 
to  defeat  them.  They  find  problems  before  they  can  be 
exploited. 

The  analogy  may  be  flawed,  but  it  can't  be  tested 
without  transparency.  Nutch  is  currently  the  best  shot  at 
enabling  some  form  of  public  review  for  defeating  search 
engine  spam. 

DEPLOYMENTS/OPERATIONS 

Scalability/Cost  Effectiveness.  One  objection  the  Nutch 
project  often  hears  is  that  search  is  simply  too  resource- 
hungry  to  be  a  good  open  source  project.  In  fact,  a  Web 
search  engine  can  be  operated  for  fairly  modest  sums. 

A  note  on  index  size:  Web  search  engines  make  claims 
about  the  sizes  of  their  indexes,  but  these  are  not  directly 
comparable.  Some  count  the  number  of  pages  they've 


fetched;  some  count  the  number  of  URLs  that  may  be 
returned,  even  though  they've  not  been  fetched  but  only 
referenced  by  another  page.  Also,  many  pages  are  dupli- 
cates: a  given  site  may  respond  to  more  than  one  host 
name,  giving  all  of  its  pages  multiple  URLs.  And  although 
bigger  is  almost  always  better,  it  may  not  be  much  better. 
An  index  with  just  100  million  pages  can  perhaps  satisfy 
99  percent  of  users'  searches  as  well  as  a  5-billion-page 
index.  So  if  you  are  primarily  interested  in  cost-effec- 
tive usability,  you  might  build  only  a  100-million-page 
system.  But  if  you  are  interested  in  bragging  rights  and 
satisfying  rare,  obscure  searches,  then  a  larger  index  may 
be  justified. 

Here  we  will  outline  Nutch's  operational  costs.  All 
figures  are  meant  to  be  illustrative,  since  the  performance 
and  cost  of  hardware,  software,  and  bandwidth  are  all 
changing. 

Nutch  deployments  use  two  classes  of  machines:  back- 
end  machines,  for  crawling,  database,  link  analysis,  and 
indexing  tasks;  and  front-end  machines,  which  perform 
searches  and  serve  search  results. 

A  typical  back-end  machine  is  a  single-processor  box 
with  1  gigabyte  of  RAM,  a  RAID  controller,  and  eight  hard 
drives.  The  filesystem  is  mirrored  (RAID  level  1)  and  pro- 
vides 1  terabyte  of  reliable  storage.  Such  a  machine  can  be 
assembled  for  a  cost  of  about  $3,000. 

One  such  back-end  machine  is  required  for  every  100 
million  pages.  Thus,  to  maintain  an  index  of  1  billion 
pages  requires  10  back-end  machines,  or  about  $30,000  in 
hardware. 

A  typical  front-end  machine  is  a  single-processor  box 
with  4  gigabytes  of  RAM  and  a  single  hard  drive.  Such  a 
machine  can  be  assembled  for  about  $1,000. 

The  query-handling  capacity  of  front-end  machines 
varies,  depending  on  how  much  each  must  search.  For 
example,  if  each  front-end  machine  is  given  25  mil- 
lion pages  to  search,  then  each  can  perform  about  two 
searches  per  second.  Thus,  a  100-million-page  index  could 
be  searched  with  four  front-end  machines  ($4,000)  while 
a  1 -billion-page  index  requires  40  front-end  machines 
($40,000),  but  such  configurations  could  still  handle  only 
two  searches  per  second.  In  this  case,  access  to  a  disk-resi- 
dent index  is  the  primary  bottleneck. 

Query  handling  is  more  cost  effective  when  primary 
index  structures  fit  within  RAM.  In  particular,  if  each 
front-end  machine  is  required  to  handle  only  2  mil- 
lion pages,  then  each  can  handle  perhaps  50  searches 
per  second.  In  this  configuration  a  100-million-page 
index  would  require  50  front-end  machines  ($50,000) 
and  a  1 -billion-page  index  would  require  500  machines 
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($500,000).  This  is  half  the  cost  per  query  of  the  lirst  case. 
Here  the  bottleneck  is  primarily  the  CPU.  Further  search 
software  optimizations  can  make  this  configuration  even 
more  cost  effective. 

Note  that  as  traffic  increases,  front-end  hardware 
quickly  becomes  the  dominant  hardware  cost. 

Thus  far  we  have  discussed  only  the  raw  hardware 
costs.  In  addition,  there  are  hosting  costs.  These  are 
primarily  electricity  (as  consumed  both  directly  by  the 
hardware  and  by  the  air  conditioning  required  to  cool 
the  hardware),  bandwidth,  and  others  (racks,  network 
equipment,  facility  rental,  etc.).  Electricity  dominates 
these  costs,  and  together,  these  costs  easily  dominate  raw 
hardware  costs.  For  example,  you  might  amortize  the 
cost  of  hardware  over  three  years,  so  that  $100,000  of 
hardware  is  less  than  $3,000  per  month;  but  power,  space, 
and  bandwidth  for  100  machines  can  easily  cost  more 
than  that.  Since  hosting  costs  are  even  more  variable  than 
hardware  prices,  let's  just  assume  that  hosting  costs  are 
approximately  the  same  as  three-year  amortized  hard- 
ware costs.  Thus,  a  complete  system  might  cost  anywhere 
between  $800  per  month  for  two-search-per-second  per- 
formance over  100  million  pages,  to  $30,000  per  month 
for  50-page-per-second  performance  over  1  billion  pages. 

A  note  on  bandwidth  consumption:  If  we  assume  that 
Web  pages  average  around  10  kilobytes,  and  each  must 
be  re-fetched  monthly,  then  fetching  1  billion  pages  per 
month  requires  around  40  Mbps  (megabits  per  second) 
inbound.  Bandwidth  costs 


are  typically  symmetric,  so 
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of  money  to  operate,  many  Web  search  engines  may  not 
serve  nearly  as  much  traffic  and  need  not  search  nearly 
so  many  pages.  In  a  world  with  lots  of  deployed  search 
engines,  the  vast  majority  will  serve  small  audiences. 
The  costs  are  also  well  within  reach  for  research  groups, 
governmental  departments,  and  small-  to  medium-size 
companies. 

One  much  trickier  source  of  cost  savings  is  automat- 
ing most  system  administration  tasks.  We  believe  there  is 
a  lot  of  ground  to  be  gained  here,  and  Nutch  has  not  yet 
started.  It's  not  clear  how  to  use  the  open  source  program- 
ming style  for  something  that's  so  tied  to  the  deploy- 
ment, but  we  need  to  do  it. 

Who  Should  Run  Nutch-Based  Web  Search  Engine(s)? 
Nutch.org  is  dedicated  to  making  the  Nutch  software  bet- 
ter for  everyone.  That  might  mean  running  a  small  demo 
site  or  making  a  search  service  available  for  academic 
research,  but  we  do  not  intend  to  run  a  destination  search 
site.  Running  such  a  service  would  put  Nutch  in  competi- 
tion with  its  users.  Instead,  we  hope  that  primarily  other 
institutions  will  run  the  Nutch  software. 

Governments,  universities,  and  nonprofits  are  terrific 
candidates  for  Nutch.  These  organizations  often  have 
special  obligations  that  for-profit  companies  don't  (e.g., 
a  seniors'  organization  might  want  to  offer  search  with 
a  special  usability  focus),  so  having  the  source  code  to 
Nutch  is  a  huge  advantage.  Further,  these  groups  often 
don't  have  lots  of  cash  to  spend  on  solutions. 

We  don't  have  great  data  yet  on  who  is  running  Nutch. 
As  far  as  we  can  tell,  the  most  active  Nutch  users  are 
universities  and  academic  research  groups.  Some  are  using 
Nutch  as  part  of  a  class,  and  some  are  using  it  because 
their  research  depends  on  access  to  indexed  pages  that 
they  can  control.  Others 
are  pulling  apart  the  sys- 
tem, taking  elements  that 
seem  useful.  It's  too  early 
to  expect  any  updates  back 
from  researchers,  but  we 
hope  this  is  coming  soon. 
One  type  of  nonprofit 
in  particular  that  we  hope 
to  see  is  a  PSE  (public 
search  engine),  a  search 
site  that  is  as  usable  as  any 
commercial  one,  but  that 
operates  without  advertis- 
ing or  commercial  engage- 
ment. These  engines  will 
help  make  good  on  Nutch's 
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The  International  World  Wide 
Web  Conference  Committee 
(IW3C2)  and  the  Association 
for  Computing  Machinery  "" 

(ACM)  cordially  invite  you  to 
participate  in  the  13th  World 
Wide  Web  Conference  in  New  York 
City,  17-22  May  2004. 


Beginning  with  the  first  international 
WWW  Conference  in  1994,  this 
prestigious  series,  organized  by  the 
International  World  Wide  Web  Con- 
ference Committee  (IW3C2),  has 
provided  a  public  forum  for  the  WWW 
Consortium  (W3C)  through  the 
annual  W3C  track. 

The  WWW2004  conference  will  be 
held  in  Manhattan  at  the  Sheraton 
Hotel.  The  conference  will  consist  of  a 
three-day  technical  program,  preceded 
by  several  days  of  special  events 
exploring  the  impact  of  the  Web,  a  day 
of  tutorials  and  workshops,  and  finally, 
a  Developers  Day. 
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promise  to  make  search  results  more  transparent  to  users. 
Conversely,  they  will  make  for-profit  engines  easier  to 
spot  if  they  adjust  rankings  for  commercial  gain. 

A  PSE  might  get  its  funds  through  donations  from 
users,  corporations,  or  foundations,  just  as  public  broad- 
casting channels  do.  It's  worth  noting  that  PSEs  do  not 
need  to  process  a  huge  percentage  of  search  queries  to  be 
successful.  Their  existence  will  help  ensure  that  search 
users  always  have  a  good  alternative  (one  that  doesn't 
exist  today). 

What  about  for-profit  corporations?  We  think  lots  of 
companies  will  want  to  run  small  search  engines  for  in- 
house  use  or  on  their  public  Web  sites.  For  most  of  these 
companies,  search  will  be  just  another  item  they  have  to 
take  care  of,  not  their  main  focus. 

Nutch  should  also  enable  small  search-technology 
companies  to  be  more  creative,  just  as  other  open  source 
projects  have  enlarged  what  small  teams  can  accomplish. 

We  hope  that  Nutch,  by  providing  free,  open  source 
Web  search  software,  will  help  both  to  promote  transpar- 
ency in  Web  search  and  to  advance  public  knowledge  of 
Web-search  algorithms.  Q 
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MIKE  CAFARELLA  worked  as  a  software  engineer  at 
Silicon  Valley  startups  Marimba  Corporation  and  Tellme 
Networks  from  1998  to  2002.  In  2002  he  began  work  on 
the  Nutch  project,  which  he  continues.  In  2003  he  started 
a  Ph.D.  program  in  computer  science  at  the  University 
of  Washington.  He  graduated  from  Brown  University  in 
1996.  In  1997,  he  earned  a  M.A.  from  the  University  of 
Edinburgh  while  on  a  Fulbright  scholarship  to  the  United 
Kingdom. 

DOUG  CUTTING  has  worked  on  search  technology  for 
more  than  15  years.  This  includes  five  years  at  Xerox 
PARC,  three  years  at  Apple  with  its  Advanced  Technol- 
ogy Group,  and  more  than  four  years  at  Excite.  In  1998 
he  wrote  Lucene  (http://jakarta.apache.org/lucene/), 
an  open  source  search  library  that  subsequently  became 
part  of  the  Apache  Jakarta  project.  In  2002  he  started 
Nutch  (http://www.nutch.org/),  an  open  source  Web 
search  application. 
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File  Sharing  and  Sales 

To  the  Editor: 

While  the  Recording  Industry  Association  of  America  pursues  its  heavy-handed  offensive  against  music 
downloading  and  file  sharing  (Business  Day,  April  5),  other  owners  of  cultural  content  have  found  ways 
to  live  (and  flourish)  with  emerging  technologies. 

I  have  operated  a  small  family-owned  historical  film  archives  for  20  years.  Several  years  ago,  we  digitized 
the  most  sought-after  images  in  our  collection  and  placed  them  online  for  free  downloading  and  nearly 
unrestricted  reuse. 

Our  experience  may  seem  counterintuitive,  but  it  has  been  overwhelmingly  positive:  the  more  we  give 
away,  the  more  we  actually  sell. 

File  sharing  and  free  downloading  have  increased  the  ubiquity  and  prominence  of  our  collection  and  have 
given  it  ample  publicity  at  very  little  cost,  resulting  in  increased  income. 

Might  there  be  a  lesson  here  for  the  music  industry? 

RICK  PRELINGER 

San  Francisco,  April  5,  2004 
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Google's  New  1  Gigabyte  Webmail  Service 
Everybody  Plays  the  Fool 

by  Richard  W.  Wiggins 

April  5,  2004  —  On  April  1,  Google,  Inc.  announced  a  new  Webmail  service  that  will  provide  each  subscriber 
with  a  free  1  gigabyte  mailbox.  The  new  service,  called  Gmail,  raises  the  bar  for  free  e-mail  space  by  a 
stunning  factor:  100  times  the  space  that  Hotmail  and  Yahoo!  provide.  Gmail  will  Include  advertisements 
targeted  to  the  content  of  subscribers'  mail.  Google's  announcement  shook  the  Webmail  industry.  A 
whimsical  press  release  and  the  announcement's  April  1  timing  also  generated  massive  doubts  as  to  its 
authenticity.  Privacy  advocates  also  raised  concerns  about  the  targeted  ads. 

The  Gmail  tagline  is,  "Search,  don't  sort."  Google  says:  "Gmail  uses  Google  search  technology  to  find 
messages  so  users  don't  have  to  create  folders  and  file  their  individual  e-mails."  A  Webmail  service  offering  a 
gigabyte  of  storage  could  let  you  keep  years  of  important  e-mails  in  one  place.  Combine  that  with  the  power 
of  Google  searches,  and  you've  finally  got  an  efficient  way  to  find  that  important  contract  negotiation  from  3 
years  ago.  Millions  might  flock  to  such  a  service— If  it's  real. 

The  Media  Is  Skeptical 

Google  released  news  of  the  service  to  the  media  on  March  31.  John  Markoff  covered  the  story  for  the  April  1 
edition  of  The  New  York  Times,  and  many  media  outlets  began  carrying  the  story  overnight. 

As  news  of  Google's  stunning  announcement  spread  on  April  1,  many  in  the  media  and  the  public  assumed 
those  playful  lads  In  Mountain  View  were  enjoying  another  April  Fool's  Day.  Internet  discussion  boards  were 
abuzz  with  speculation.  Google  co-founders  Larry  Page  and  Sergey  Brin  wisecracked  in  the  press  release, 
claiming  they  launched  Gmail  to  help  out  a  single  individual  who'd  complained  that  managing  e-mail  is 
chaotic. 

Reporter  Mike  Musgrove  covered  the  story  for  The  Washington  Post,  and  he  describes  how  Google's  April  1 
shenanigans  caused  problems: 

My  editor  and  I  looked  at  the  loopy  way  the  press  release  was  written  and  figured  it  was  likely 
a  joke,  considering  the  date,  and  considering  that  Google  seems  to  like  to  have  a  sort  of  playful 
corporate  image.  When  I  talked  to  Google  this  morning,  half  of  my  time  was  wasted  by  asking 
for  their  reassurances  that  this  was  not,  in  fact,  a  joke.  And  I  went  through  this  again,  in 
reverse,  with  sources  for  the  story  who  weren't  sure  this  was  for  real  or  not. 

The  BBC  scolded  Google.  Under  a  headline,  "Timing  makes  Google  an  April  Fool,"  and  a  subhead  of,  "The  first 
rule  of  public  relations:  check  the  calendar  before  your  big  product  launch,"  the  BBC  wrote:  "Google's 
insistence  that  Gmail  was  genuine  was  undermined  by  the  oddly  jokey  style  of  its  initial  announcement." 
They  quoted  Google's  vice  president  of  products,  Jonathan  Rosenberg:  "It  is  April  Fool's  Day.  We  were  having 
fun  with  this  announcement. ...We  are  very  serious  about  Gmail." 

New  Webmail  Economics 

Brewster  Kahle  is  founder  of  the  Internet  Archive,  which  houses  hundreds  of  terabytes  of  Web  history.  (A 
terabyte  is  1000  gigabytes.)  I  asked  Kahle  about  the  marginal  cost  of  one  gigabyte  of  disk.  Kahle  replied: 
"We  pay  $1.30  per  gigabyte  for  spinning  storage,  and  then  back  that  up  with  a  mirror.  This  makes  $2.60  per 
gigabyte."  Kahle  notes  that  those  figures  assume  capital  costs,  noting  that  the  hardware  should  last  3  to  5 
years.  He  observes  that  power  and  system  administration  also  factor  in,  but  with  extremely  large  data  farms, 
those  costs  are  relatively  insignificant. 

Tech  visionary  George  Gilder  told  me:  "Google  is  exploiting  the  key  abundances  of  the  era:  bandwidth  and 
storage,  summed  up  in  my  model  as  'Storewidth,'  in  order  to  supply  what  is  scarce:  Just-in-time  information. 
Google  is  the  prime  Storewidth  company."  Gilder  calculates  the  cost  of  storage  at  about  $2.33  per  gigabyte 
per  year,  including  depreciation  and  maintenance.  But  he  thinks  Google  enjoys  other  advantages: 

Since  Google  must  sustain  these  costs  anyway  to  support  its  search  capacity,  advertising 
model,  and  news  services,  I  believe  that  their  marginal  cost  for  supplying  e-mail  is  close  to  zero 
when  the  increasing  volume  of  usage  of  all  services  is  considered.  Market  share  and  volume  are 
everything  in  these  front  loaded  Internet  services.  With  more  numbers  and  better  targeted 
advertising,  Google  will  make  out  like  bandits,  without  the  downside  of  encountering  Wyatt  Earp 
at  the  FTC  corral.  ^, 

Many  analysts  see  the  Gmail  announcement  as  a  challenge  to  Hotmail  and  Yahoo!  mail.  Certainly  those  two 
leading  Webmail  providers  must  choose  whether  to  match  Google's  new  threshold.  Perhaps  they  can  afford 
to;  Hotmail  has  the  deep  pockets  of  Microsoft  behind  it,  and  Yahoo!  remains  a  formidable  competitor. 

The  Gmail  announcement  may  have  more  immediate  effect  on  companies  like  USA.net,  which  markets 
Webmail  services  to  enterprises  and  individuals  under  the  Netaddress  brand.  Ironically,  on  March  31,  as 
Google  was  about  to  announce  Gmail,  Netaddress  offered  subscribers  a  special  deal:  10  megabytes  of 
storage  at  $30  per  year. 

Even  though  Google  is  new  to  the  Webmail  business,  its  position  is  strong.  It  builds  its  Webmail  service  on 
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the  dawn  of  an  IPO  expected  to  raise  $20  to  $25  billion  dollars.  Just  as  Google  is  drawing  advertising 
revenues  from  competitors  in  Web  space;  its  targeted  e-mail  advertising  program  could  bring  in  substantial 
revenues.  Google  can  afford  to  serve  millions  of  new  customers  as  if  builds  its  revenues  and  it  has  the  scale 
and  skills  to  manage  massive  amounts  of  data  across  data  centers  worldwide. 

For  the  short  term,  companies  such  as  USA.net  and  Criticalpath.net  may  take  solace  in  their  hosted  e-mail 
services,  which  allow  smaller  companies  to  outsource  management  of  corporate  e-mail.  Fortune  500 
companies  are  unlikely  to  entrust  enterprise  e-mail  to  an  externally  hosted  service. 

Terabytes  of  Personal  Data  on  Google's  Server  Farms 

Gmail  could  attract  millions  of  subscribers.  The  notion  gives  some  people  pause.  The  result  could  be  the 
largest  migration  of  personal  information  in  history.  Karen  Coyle,  a  digital  library  specialist 
fhttp:// www. kcovle.net),  remarked:  "Doesn't  this  have  the  potential  of  being  one  of  the  greatest  risks  to 
personal  privacy  in  recent  history?  Although  I  realize  that  there  are  backups  and  such,  my  ISP  does  not  store 
my  e-mail  on  a  permanent  basis,  and  as  far  as  I  know  there  is  no  online  access  to  my  mail  history.  And  other 
Web  mail  providers  have  disk  space  limits  that  encourage  you  to  download  or  delete  the  mail.  If  your  mail  is 
stored  at  Google,  and  you  have  an  incentive  to  keep  it  all  because  you've  got  the  1  gig  and  the  great  search 
capability..." 

Google's  vice  president  of  engineering,  Wayne  Rosing,  told  me:  "We  tried  to  craft  a  privacy  policy  that's  of 
the  highest  standards.  At  the  same  time  we  had  to  include  exceptions  necessary  to  run  the  service.  Every 
mail  service  I've  been  associated  with  has  to  have  the  ability  for  sysadmins  to  go  in  and  diagnose  problems 
with  mailboxes.  But  our  employees  sign  a  document  indicating  they  understand  how  important  privacy  is. 
They  understand  that  violating  our  privacy  policy  would  lead  to  termination." 

Mindshare 

Perhaps  Google  is  crazy  like  a  fox.  After  all,  every  article  that  questioned  the  announcement  probably  led  to  a 
follow-up  story  confirming  its  veracity.  In  one  day,  Google  generated  tremendous  buzz  for  Gmail— a  service 
whose  launch  date  we  still  don't  know.  At  this  point,  a  handful  of  users  are  testing  a  preview  version  of  the 
service.  For  more  information,  see  http://qmail.google.com. 

Do  not  doubt  Google's  ability  to  promote  its  new  service  in  other  ways.  Do  a  Google  search  for 
"Webmail"— the  first  sponsored  link  is  an  ad  for  Gmail. 

Richard  W.  Wiggins  is  an  auttior  and  speal<er  who  specializes  in  Internet  topics.  He  is  a  senior  information 
technologist  at  the  computer  center  at  Michigan  State  University.  His  e-mail  address  is 
rich@richardwipQins.com. 
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Social  Security 
Isn't  Doomed 

SOCIAL  SECURITY  REFORMERS  ARE  TOTALLY 
bogged  down.  Whether  diat's  good  news  or  bad 
news,  I'm  not  sure.  Most  Republicans  seek  an 
extreme  makeover— diverting  some  of  your  payroll 
tax  into  a  private  investment  account.  Most  Democrats  say 
phooey,  that's  too  cosdy  and  risky,  and  anyway  the  program 
only  needs  some  tweaks.  A  few  rebels  think  that  everyone  is 


crying  wolf:  Social  Security  may  not  need 
any  fixes  at  all. 

The  2004  Social  Security  trustees'  re- 
port, due  out  this  week,  shows  that  foil 
benefits  can  be  paid  for  nearly  40  years, 
without  skipping  a  beat.  After  that,  payouts 
could  be  reduced  by  25  percent,  but 
Congress  wouldn't  let  the  program  unravel 
that  far.  Even  if  it  did,  the  "loss"  wouldn't 
be  as  bad  as  it  sounds.  Real  benefits  (after 
inflation)  rise  every  year  for  new  retirees. 
Even  with  a  25  percent  haircut,  benefits  in 
2042  will  be  much  higher  than  they  are  to- 
day, says  Dean  Baker  of  the  Center  for  Eco- 
nomic and  Policy  Research  and  coauthor, 
with  Mark  Weisbrot,  of  the  book  "Social 
Security:  The  Phony  Crisis." 

Making  guesses  about  the  2040  s  seems 
nuts.  It's  done  only  because,  by  law.  Social 
Security's  books  have  to  balance  (on  paper) 
over  75  years.  Today's  projections  will  be 
wrong.  We  just  don't  know  how  much  and 
in  which  direction. 

Take  Social  Security's  sacred  trust  fond. 
Back  in  1994,  the  trustees  expected  it  to 
run  dry  in  2029.  Since  then,  they've  pushed 
the  dry-up  date  back  to  2042.  Did  the  pro- 
gram magically  improve  while  we  slept.'' 
No  way.  All  that  happened  was  that  the 
economy  grew  faster  than  projected. 

Demographics  alone  aren't  to  blame  for 
the  trust-fond  gap.  Social  Security  taxes 
were  raised  and  benefits  shaved  in  1983  to 
pay  for  the  foture  boomer  wave.  What 
hasn't  been  covered  is  the  higher  lifetime  in- 
comes we're  getting  from  Social  Security  as 
a  result  of  retiring  earlier  and  living  so  long. 

The  thought  that  Social  Security  is  basi- 
cally sound  may  come  as  a  shock.  For  years 
all  you've  heard  are  scaremongers  warning 
of  "looming  insolvency"  in  a  program  they 
love  to  call  a  Ponzi  scheme. 

But  that's  just  propaganda,  spread  by 
people  determined  to  shake  your  faith  in  the 


government's  most  popular  program.  Once 
you  believe  that  it  won't  be  there  when  you 
retire,  you'll  presumably  vote  for  private  ac- 
counts. Believers  want  us  to  save  for  our- 
selves, leaving  Social  Security  to  the  poor. 
The  disinformation  artists  even  claim 
that  Social  Security  costs  too  much  to  save. 
But  here's  a  factoid  to  tuck  away:  last  year 
the  program's  75-year  shortfall  came  to 
$3.8  trillion.  President  George  W.  Bush's 
2001  and  2003  tax  cuts  equal  $9  trillion  to 
$12  trillion.  If  the  tax  cuts  were  pared  by  a 
third  and  those  revenues  dedicated  to  So- 
cial Security,  the  "crisis"  disappears,  says 

Scaremongers  warn  of 
'looming  insolvency'  and 
claim  that  the  system 
costs  too  much  to  save. 
Don't  become  a  sucker 
for  private  accounts. 

Peter  Orszagof  the  Brookings  Institution. 

But  Social  Security's  supporters  can't 
just  lie  back  in  their  lawn  chairs  and  say  it 
will  all  work  out.  Too  many  people  now 
think  it  won't.  Besides,  there  are  still  those 
longer  lifespans  to  pay  for,  and  the  sooner 
we  start,  the  better.  The  program's  backers 
are  proposing  various  tweaks. 

Orszag  and  Peter  Diamond,  of  the  Mas- 
sachusetts Institute  of  Technology,  offer 
some  ideas  in  their  new  book,  "Saving  So- 
cial Security:  A  Balanced  Approach."  They 
wouldn't  depend  on  the  luck  of  the  stock 
market,  as  private  investment  accounts 
would.  Nor  would  they  drive  up  the  deficit 
(yay!).  They  would  simply  slow  the  growth 
of  foture  benefits  and  raise  the  payroll  tax  a 


bit,  mainly  on  upper  earners.  The  authors 
would  also  freeze  the  estate  tax  in  2009, 
when  the  exemption  reaches  $3.5  million 
per  person.  Larger  estates  (about  10,000  of 
them  a  year)  could  be  taxed  with  the  pro- 
ceeds added  to  Social  Security's  trust  fond. 

Robert  Ball,  a  former  Social  Security 
commissioner,  would  plant  a  small  tax  in 
the  legislation  that  clicks  in  automatically 
when  the  trust  fond  begins  to  shrink.  "That 
should  avoid  periodic  false  crises  about  So- 
cial Security's  finances,"  he  says. 

The  plans  from  conservatives  all  in- 
clude private  accounts.  Most  of  the 
blueprints  I've  seen  are  laughably  mislead- 
ing. But  one  put  forth  by  Reps.  Jim  Kolbe 
(a  Republican)  and  Charles  Stenholm  (a 
Democrat)  offers  a  possible  framework  to 
build  on.  They'd  divert  2  or  3  percent  of 
your  Social  Security  contribution  into  a 
private  account  invested  in  indexed  stock 
or  bond  fonds.  Once  your  account  reaches 
$7,500,  you  could  switch  to  fonds  run  by 
money  managers.  Over  time,  your  Social 
Security  benefits  would  drop.  You  hope 
your  investment  will  grow  by  enough  to  re- 
place (or  outpace)  the  benefits  you  lost. 

It's  not  cheap  to  switch  to  private  ac- 
counts. All  the  people  owed  Social  Security 
checks  will  have  to  be  paid  while  you're  us- 
ing some  of  the  tax  money  to  invest  for 
yourself  Kolbe  and  Stenholm  would  cover 
the  cost  with  higher  taxes  on  upper  earn- 
ers, cuts  in  benefits  and  cost-of-living  rais- 
es and,  yes,  more  government  borrowing. 

A  Republican  sweep  in  November 
might  put  Social  Security  back  on  the 
table.  But  the  parties  are  too  far  apart  to 
make  any  significant  changes.  The  Repub- 
licans want  private  accounts  and,  Kolbe 
says,  "we'll  be  at  loggerheads  as  long  as  the 
Democrats  reject  them."  Luckily,  Social  Se- 
curity is  strong  enough  to  soldier  on. 

Reporter  associate:  TEMMA  EHRENFELO 
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In  six  short  years,  two  Stanford  grad  students  turned  a 
simple  idea  into  a  multibillion-doUar  phenomenon  and 
changed  our  lives.  Now  competitors  are  searching  for  a 
way  to  dethrone  the  latest  princes  of  tlie  Net. 


BY  STEVEN  LEVY 


HORT  OF  "YOU'RE  UNDER 


Lgs  that  the  leaders  of  a 
oung  technology  company 
^ould  like  less  to  hear  than 
Bill  Gates  thinks  you've 
utt  and  now  he  wants  your 

gey  Brin  and  Larry 

eem  ruffled  at  all.  Hanging 
in  their  spacious  new 


even  confident,  in 
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disorganization  to  build  a  substan- 
tial company  out  of  one  unques- 
tionably brilliant  idea. 

Let's  face  it— it's  good  to  be 
Google.  Every  minute,  worldwide, 
in  90  languages,  the  index  of  this 
Internet-based  search  engine  creat- 
ed by  these  Stanford  doctoral 
dropouts  is  probed  more  than 
138,000  times.  In  the  course  of  a 
day,  that's  over  200  million  search- 
es of  6  billion  Web  pages,  images 
and  discussion-group  postings. 
Searches  for  golf  clubs,  song  lyrics, 
tomorrow  night's  blind  date, 
recipes  and  the  unaltered  screen 
shots  of  Janet  Jackson's  Super  Bowl 
boo-boo.  Amazingly,  the  majority 
of  those  queries  evoke  satisfactory, 
even  revelatory,  results.  Google  has 
changed  the  way  the  world  finds 


*->•;*'  al"*^^ 


things  out,  and  enticed  it  to  look  for  things 
previously  considered  unfindable. 

Not  only  has  Google  very  famously  be- 
come a  verb,  but  Silicon  Valley  is  holding  its 
collective  breath  for  the  seemingly  inevitable 
IPO,  when  Google  will  become  a  synonym 
for  another  word:  wealthy.  Still,  even  with- 
out a  market  cap,  the  two  Google  guys  re- 
cendy  made  the  Forbes  billionaire  list. 


Here  they  are,  oudining  their  plans  for 
getting  all  the  world's  information  on  their 
thousands  of  servers  and  delivering  it  to 
anyone  who  can  peck  a  query  into  a  search 
field.  Brin,  30,  the  ruminative  Russian-bom 
son  of  a  math  professor  who  is  Google's 
business  visionary,  won't  sit  down:  he's 
bothered  by  a  mild  injury  incurred  by  his 
hobby  of  gymnastics.  As  Brin  stretches,  31- 
year-old  Page,  the  guardian  of  Google's 
secret-sauce  search  techniques,  tells  a  story. 

"I  was  researching  big  computer  net- 
works the  other  day,  networks,"  he  says.  "I 
put  this  really  strange  query  into  Google, 
and  got  this  research  paper  with  the  exact 
things  I  wanted.  Which  would  have  been  a 
many-hour  process  normally.  It  took  all  of 
30  seconds.  I  gave  it  to  a  bunch  of  people  in 
the  company,  and  now  we  have  this  project. 
It's  very  likely  that  I  wouldn't  have  done 

Join  Steven  Levy  for  a  Live  Talk 

lyUJll  on  Thursday,  March  25,  at  noon.  ET, 
on  Newsweek.com  on  MSNBC 
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Keeping  happy  wanderers  happy. 

Whoever  said  the  journey  is  the  destination  hasn't  spent  time 
snorkeling  in  St.  Martin.  Using  HP  NonStop'"  servers  and  a  Java 
architecture  implennented  by  HP  Services,  Travelocity  delivers 
three  times  more  flight  and  hotel  options  in  less  time  than  it  takes 
to  rent  swim  fins.  So  travelers  spend  more  time  in  the  water, 
and  less  time  online,  www.hp.com/plus_traveiocity 


everything  is  possible 


?.    r 


-Packard  Development  Company,  L.P  Travelocity  is  o  registered  trademark  of  an  affiliate  of  Sabre  Holdings  Corporate 
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that  at  all  if  it  had  been  more  difficult.  I 
think  the  value  of  that  can  be  very  large, 
making  the  world  more  efficient." 

Exaaly.  Google  has  made  such  eureka 
moments  as  common  as  sneezing.  Who 
hasn't  had  such  a  revelation  on  Google, 
whether  the  discovery  was  an  old  girlfriend's 
whereabouts  or  a  cutting-edge  treatment  for 
a  rare  disease.'  Amazing  to  consider  that  less 
than  a  decade  ago,  search  was  a  backwater, 
deemed  not  very  interesting  and  certainly 
not  very  profitable.  Instead,  Internet  compa- 
nies put  their  energy  into  developing  feature- 
laden  "portals."  Then  came  Larry  and  Sergey, 
and  search  became  the  center  of  the  Internet 
universe.  "Search  is  the  ultimate  killer  online 
app,"  says  Bob  Davis,  former  CEO  of  Lycos. 
"The  Internet  without  search  is  like  a  cruise 
missile  without  a  guidance  system." 


The  rest  of  the  industry  has  noticed. 
Boy,  has  it  noticed.  To  quote  the  numbers, 
Safa  Rashtchy  at  Piper  Jaffiay  reports  that 
annual  search  revenues  are  just  under  $4 
billion  today  (about  a  billion  of  that  is  Goo- 
gle's) and  will  almost  triple  over  the  next 
four  years.  But  those  figures  don't  reflect 
search's  real  impact;  those  empty  query 
fields  on  search  pages  are  the  front  doors  to 
the  Internet.  If  you're  not  indexed  by  Goo- 
gle, you  pretty  much  don't  exist.  And  if 
you're  a  business  with  a  high  page  rank— a 
key  metric  that  determines  whether  your 
site  will  be  displayed  high  in  the  results  for 
a  given  query,  or  buried  a  few  hundred 
mouseclicks  back— you  can  count  on  a 
thriving  online  trade.  A  horde  of  new  com- 
panies has  arisen  whose  services  focus  on 
performing  all  the  tweaks  and  playing  all 


the  tricks  that  supposedly  get  your  Web  site 
listed  higher  on  Google's  results  pages. 
(Google  constantly  fine-tunes  its  system  to 
frustrate  such  manipulating.)  If  you  can't 
afford  to  hire  one  of  those  firms,  buy  the 
latest  offering  in  a  famous  series:  "Search 
Engine  Optimization  for  Dummies." 

So  it's  no  surprise  that  all  the  compa- 
nies that  missed  out  the  first  time  around 
are  now  gearing  up  for  the  Search  Wars,  a 
clash  that  will  be  waged  with  algorithms, 
measured  by  terabytes  and  scored  by  click- 
throughs.  Gunning  for  Google  are  Internet 
giants,  clever  new  start-ups  and  an  800- 
pound  gorilla  in  Redmond,  Wash.  They 
might  not  have  gotten  it  at  first,  but  now  it 
seems  terribly  obvious.  "Search  has  always 


Seek  and  Ye  Shall  r  ina 

Not  all  search  engines  are  created  equally.  They  differ  in  the  number  of  sites  they  reference,  how  t..., 
"hits"  and  retrieval  speed.  An  unscientific  look  at  the  four  leaders  (AOL's  engine  is  powered  by  Google) 
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MSN  SEARCH     I    YAHOO  SEARCH 


Guides  and  opinions:  Top 

results  offered  useful  buy- 
ing guides.  Searching  un- 
der the  Groups  option 
helped  find  buyers 
opinions  about 
various  brands 


Nice  view:  News 
from  the  city  ap- 
peared at  the  top, 
along  with  a  link  to  an  offi- 
cial tourist  guide.  An  image 
search  provided  stunning 
views  of  the  capital. 


Show  and  tell:  Very 
similar  to  Google  (right 
down  to  the  number  of 
sponsored  links),  but 
^^  harder  to  find  user 
opinions.  And  why 
the  tiny  type? 


It's  a  dream:  The  first  re- 
sult offered  30  practical 
tips  for  a  better  night's 
rest  ("take  a  warm  bath," 
"eat  a  bedtime  snack"). 
Why  can't  every  search 
be  as  simple? 

Star  search:  A  search 
of  Usenet  groups  yielded 
info  on  NBA  player  Brian 
Evans  and  singer  Brian 
Evans  but  little  about  a 
friend  who  worked  on 
HBO's  "Six  Feet  Under" 


Flighty:  Hotel 
guides  dominated, 
with  plenty  of  travel 
packages.  Some  spon- 
sored sites  didn't  know 
that  we  wanted  to  visit 
Russia,  not  Idaho. 

Nice  try:  Top  results  were 
about  the  movie  starring 
AlPacinoand  Robin 
Williams.  Will  reading 
about  this  paint-by 
numbers  thriller 
put  you  to  sleep? 

Say  who? 

Numerous  people 
with  the  same  name 
but  nothing  on  my  guy- 
until  I  narrowed  search 
with  extra  info,  namely 
"Six  Feet  Under." 


Where  to  buy:  Good  at 
narrowing  down  places 
to  buy  the  expensive  sets, 
and  also  offered  price 
comparisons  of  selected 
models  with  direct  links 
to  vendors. 

All-in-one:  Detailed  info 
on  tourist  spots  with  links 
for  shopping,  entertain- 
ment. Travel  and  lodging 
can  be  booked  through 
a  partnership  with 
Travelocity. 


Herbal  tea:  In 

addition  to  the 
movie,  plenty  of 
sponsored  links  suggest- 
ing white-noise  CDs 
and  "natural 
remedies." 


Haystack: 

Yahoo's  People 
Search  provided 
63  individuals  named 
Brian  Evans  in  California 
alone.  A  middle  initial 
would  have  helped. 


Personal  shopper: 

Sponsored  results  domi- 
nated, but  filters  helped 
prune  them  down.  Do  you 
want  the  42-inch  model  or 
would  you  like  to  shop  for 
the  giant  60-incher? 

History  buff:  Links  to 
travel  portals  and  hotel 
guides.  Related  search 
options  included  historical 
information  and  news 
about  President 
Vladimir  Putin. 


Soft  pillows: 

Again,  plenty  of 
sponsored  sites  for  a 
variety  of  cures,  along  with 
some  decent  links  to  Web 
pages  devoted  to  sleep 
disorders. 

Ask  someone  else: 

Jeeves  is  supposed  to 
know  all  the  answers,  but 
evidently  common  names 
are  a  challenge  for  the 
searchmeister. 

-PETER  suciu 
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been  essential  to  people's  lives,'* 
says  Jeff  Weiner  of  Yahoo. 
"We're  all  trying  to  seek  happi- 
ness—a new  car,  a  job,  a  spouse 
...  it's  how  we  live." 

What  does  Brin  think  of  the 
gathering  forces?  He  ...  stretch- 
es. "I've  seen  companies  ob- 
sessed with  competition,  say, 
with  Microsoft,  that  keep  look- 
ing in  their  rearview  mirror  and 
crash  into  a  tree  head-on  be- 
cause they're  so  distracted,"  he 
says.  "If  I  had  one  magic  bullet, 
I  wouldn't  spend  it  on  a  com- 
petitor, I'd  spend  it  to  make 
sure  we're  executing  as  well  as 
we  possibly  can.  I  think  we're 
doing  a  pretty  good  job." 

The  folks  at  Yahoo  can't  dis- 
agree. Just  over  a  year  ago  those 
at  the  archetypical  Internet  por- 
tal realized  that  while  the  world 
was  bowing  before  the  altar  of 
seaich,  their  company  was  litde 
more  than  an  overtaxed  Web 
directory  and  two  pieces  of  pa- 
per licensing  other  people's 
search  technology  (including 
you  know  whose).  People  didn't 
Yahoo  anybody— they  Googled. 
And  for  the  folks  at  Yahoo,  that 
could  not  stand. 

It  cost  more  than  a  billion 
dollars— most  of  it  buying  tech- 
nology—but Yahoo  is  now  mak- 
ing its  bid  to  be  a  Google  buster. 
Last  month  it  unveiled  a  rebuilt 
engine,  which  spits  out  results 
comparable  to  the  other  guy's. 
The  long-term  strategy  is  to  tap 
the  treasure  house  of  informa- 
tion that  lives  elsewhere  on  the 
busy  Yahoo  portal.  So  your 
search  might  draw  from  Yahoo's 
traffic  reports,  shopping  serv- 
ices, maps,  financial  data  and 
hot  Britney  gossip.  "Search  results  are  not 
enough,"  says  Weiner.  "We're  going  to  add 
another  layer." 

Part  of  Yahoo's  new  technology  portfolio 
is  Overture,  a  company  that  pioneered  an 
advertising  practice  that  certain  search 
purists  regard  as  blasphemy:  mingling  "paid 
inclusions"  with  the  results  normally  deliv- 
ered in  response  to  a  search  query.  "We  nev- 
er claimed  it  was  a  better  approach  for  doing 
research  on  18th-century  Spain,"  says  Ted 
Meisel,  who  came  to  Yahoo  as  head  of  Over- 
ture. "But  if  you're  trying  to  buy  a  power 
washer  for  your  back  deck,  it's  a  pretty  good 
way  to  find  what  you  need."  Now  Yahoo  has 
begun  a  Content  Acquisition  Program 
(CAP)  that  establishes  a  controversial  rela- 
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tionship  between  its  search  business  and 
companies  that  want  to  appear  in  the  results 
pages.  In  exchange  for  a  fee,  companies  can 
provide  feeds  of  its  pages  to  Yahoo's  search 
index.  Weiner  says  that  such  pages  won't  get 
unfair  consideration,  but  critics  are  ques- 
tioning whether  the  practice  affects  the  in- 
tegrity of  the  results.  (Google's  ads  appear 
alongside  its  "pure"  search  results,  and  are 
marked  as  such.) 

Meanwhile,  Google  has  innovated  with 
a  program  it  calls  AdSense,  which  places 
ads  on  Web  sites  that  don't  belong  to  Goo- 
gle—other businesses,  nonprofit  or  aca- 
demic institutions  and  even  blogs.  The 
effects  are  only  gradually  becoming  appar- 
ent. Software  designer  Tim  Bray  was  im- 


pressed when  he  signed  up  for  AdSense  ads 
on  his  blog— he  says  it  changed  an  expen- 
sive hobby  to  a  profitable  sideline— but 
worries  that  the  pressure  to  expose  the  ads 
to  new  users  might  tempt  people  to  alter 
their  content  to  boost  ad  revenue. 

Brewster  Kahle,  founder  of  the  nonprofit 
Internet  Archive,  is  hoping  that  at  least  some 
of  the  search  world  remains  beyond  the 
forces  of  Mammon.  After  all,  when  71  per- 
cent of  middle-  and  high-school  students 
use  the  Internet  as  their  No.  1  research 
venue,  isn't  it  a  bit  disturbing  that  home- 
work is  becoming  a  sponsored  activity? 
Kahle  is  encouraging  an  alternative.  He  pro- 
vides the  infrastructure  for  would-be  search 
wizards  to  create  their  own  "open  source" 
(noncommercial)  engines.  "I'd  like  to  see  a 
Google  a  month,"  he  says. 

Competitors  are  popping  out  of  the 
woodwork  and  even  coming  back  from  the 
dead.  One  rival  is  a  rejuvenated  Ask  Jeeves, 
a  onetime  dot-com  bubble  casualty.   In 
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1  Ve  seen  companies  so  obsessed  with  competition  that  they 
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researcher  Eric  Horvitz.  His  alternative  is 
software  that  figures  out  what  you  riiight 
want  to  ask  for,  depending  on  what  you're 
doing.  Only  Microsoft,  which  provides 
most  people's  mail  software,  word  proc- 
essing and  desktop,  is  positioned  to 
launch  such  an  approach.  And  the  radical- 
ly revamped  file-handling  system  planned 
for  the  next  version  of  Windows,  code- 
named  Longhorn,  happens  to  be  perfectly 
suited  to  handle  complicated  searches.  In 
short,  if  Microsoft  pulls  off  its  goals  (easi- 


er said  than  done),  it  can  offer  people  a 
richer  version  of  search  than  Google  can 
deliver— even  before  they  bother  to  type 
their  queries  into  a  search  field. 

Google's  CEO  and  chairman  Eric 
Schmidt— brought  in  by  Brin  and  Page  as 
the  designated  adult  to  run  the  company 
—is  a  veteran  of  Sun  and  Novell,  so  he 
knows  something  about  being  Netscaped. 
He  thinks  it  won't  happen  to  Google. 
"Why  should  we  assume  that  that's  any 
more  likely  than  the  50  other  scenarios 


SERGEY  BRIN,  COFOUNDER. GOOGLE 

that  we  could  come  up  with  that  don't  in- 
volve this  diabolical  Netscape  kind  of 
thing?  This  search  stuff  is  very  hard  to  do, 
and  it's  really  very  hard  to  do  at  the  kind  of 
scale  that  Google  does  it  at.  People  wall 
have  multiple  choices,  and  our  goal  is  to 
get  as  many  of  those  choices  as  possible  to 
be  Google." 

The  winners  will  be  the  ones  who  inno- 
vate best,  because  the  major  breakthroughs 
in  the  field  are  yet  to  come.  "Search  is  not  a 
solved  problem,"  says  Udi  Manber,  CEO  of 


Google's  highly  anticipated  stock  offering  could  make  history 

Giddy  Over  Going  Public 


Even  the  FBI  knows  now: 
investors  will  go  gaga  for 
Google.  Earlier  this 
month,  the  Feds  arrested 
Shamoon  Rafiq,  a  Dutch  telecom 
worker  living  in  New  York  City,  for 
allegedly  selling  pre-IPQ  (Initial 
public  offering)  "friends  and  fam- 
ily" shares  of  Google  stock.  The 
shares,  the  FBI  says,  were  totally 
fake,  but  Rafiq,  who  claimed  to 
work  as  a  partner  for  Kleiner 
Perkins  Caufield  &  Byers,  a 
Google  venture -capital  backer, 
allegedly  conned  a  broker  and  a 
New  York  investment  banker  out 
of  half  a  million  dollars.  (Rafiq's 
lawyer  didn't  return  a  request  for 
comment.)  The  most  audacious 
part  is  how  Rafiq  apparently  spent 
the  money,  disemboguing  his  lu- 
cre at  nightclubs,  strip  clubs  and 
swank  hotels  on  both  coasts.  "The 
enticement  of  a  chance  to  get  in 
on  the  ground  floor  of  a  potential- 
ly lucrative  IPO ...  was  so  great 
that  it  ensnared  even  otherwise 
sophisticated  investors,"  said 
Pasquale  D'Amuro,  an  FBI  assist- 
ant director. 

Wall  Street  would  be  wise  to 
heed  those  words  if,  and  when, 
real  Google  shares  go  on  sale  to 
the  public  later  this  year  The 
much-anticipated  Google  IPO  is 
sure  to  be  the  largest,  loudest 


stock  offering  in  history.  It  was 
once  rumored  for  this  month,  but 
now  Google  execs  dismiss  all 
speculation  of  a  timetable  and  say 
they  are  focusing  on  building  the 
company.  Google  faces  a  1934 


SEC  rule  that  mandates  firms  with 
more  than  500  shareholders  and 
$10  million  in  assets  to  disclose 
financial  information  to  the  pub- 
lic. But  CEO  Eric  Schmidt  just  isn't 
showing  his  cards  (or  exuding  any 


urgency).  "I'm  a  patient  person ...  I 
have  always  believed  that  an  IPO 
for  the  company  is  a  good  thing, 
and  that  it  should  eventually  oc- 
cur But  it  shouldn't  affect  the  way 
the  company  is  run." 

Wall  Street  analysts  predict  the 
Google  stock  offering  could  value 
the  company  north  of  $12  billion. 
Google  doesn't  disclose  earnings, 
but  rumors  put  annual  revenues  at 
close  to  a  billion  dollars  with  a 
gross  profit  margin  of  about  30 
percent.  Why  are  investors  salivat- 
ing to  buy  Google  stock?  For  a  six- 
year-old  start-up,  that  growth  and 
profit  rate  is  unmatched  in  the  an- 
nals of  capitalism.  There's  also  ba- 
sic scarcity  at  work:  besides 
Google,  Yahoo,  eBay  and  Amazon 
are  the  only  premier  dot-coms  left, 
and  they're  all  public  and  trading 
at  high  valuations.  Still,  $12  billion 
is  a  stunning  price  tag  that,  factor- 
ing in  current  earnings  and  ex- 
penses, anticipates  more  than  a 
decade  of  hypergrowth  in  the  in  - 
creasingly  competitive  search  in- 
dustry. "You  don't  pay  $12  billion 
to  $15  billion  when  you  don't  know 
what  could  happen  after  five 
years,"  says Safa  Rashtchy,  an  In- 
ternet analyst  with  Piper  Jaffray. 
Search  efforts  by  rivals  like  Mi- 
crosoft and  Yahoo  cloud  Google's 
future.  There's  also  the  inherent 
volatility  of  the  media  world;  five 
years  ago,  AOL  was  king  of  the 
online  media  mountain.  When  it 
comes  to  manias  over  novel  In- 
ternet companies,  investors 
should  always  take  a  deep  breath. 

-BRAD  STONE 
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2001,  it  acquired  the  technology  and 
engineering  team  behind  the  highly  regard- 
ed search  technology  of  Teoma. 

Of  course,  Google's  biggest  problem 
may  well  be  (cue  soundtrack  from  "Jaws") 
Microsoft.  Bill  Gates  is  constitutionally 
unable  to  countenance  the  idea  that  a 
cheeky  Silicon  Valley  start-up  can  claim 
even  the  mildest  role  as  an  Internet  gate- 
way. Last  autumn  Gates  told  NEWSWEEK 
that  his  company's  complacency  in  search 
was  a  grave  error  that  would  soon  be 
corrected.  "We  didn't  make  it  as  much 
of  a  priority  as  we  should  have,"  he 
said.  "We  recognized  that,  and  we're  on 
the  job."  At  the  World  Economic  Forum 
earlier    this    year,    he    was    even    more 


frank:  "[Google]  kicked  our  butts,"  he  said. 
The  last  time  Microsoft  felt  similarly 
embarrassed— when  it  failed  to  notice  that 
the  Internet  was  kind  of  going  to  be  a  big 
thing— Gates  started  a  companywide  jihad 
that  didn't  stop  until  his  competitor  was 
eviscerated.  Now  there's  even  a  word  for 
what  happens  when  Microsoft  leverages 
its  monopoly  power  to  flip  a  rival  into  the 
toaster:  Netscaped.  You  wouldn't  blame 


its  rivals  for  quaking  in  their 
query  fields. 

Instead,  Googlers  claim 
that  this  time  it's  the  Softies 
who  are  out  of  their  league. 
Anna  Patterson,  a  Stanford 
search  wizard  recruited  by 
both  companies  (she  chose  the 
Googleplex),  had  the  chance  to 
evaluate  Microsoft's  talent.  Not 
impressed.  "It's  a  bunch  of 
people  at  the  first  grade,"  she 
says.  "Eight  junior  program- 
mers who  don't  know  anything 
about  search." 

Microsoft's  answer:  just 
wait.  "I'm  more  than  glad  to 
have  people  underestimate 
what  we  can  do,"  says  the  VP  in 
charge  of  Microsoft's  search 
effort,  Yusef  Medhi.  "You  can't 
remotely  discount  the  level  of 
technical  talent  we  have  devot- 
ed to  this." 

Though  Microsoft  hasn't 
announced  the  details  of  its 
search  strategy,  an  oudine  is 
taking  shape.  The  first  step  in- 
volves transforming  the  lack- 
luster search  engine  it  current- 
ly uses  in  MSN.  "We're  taking 
our  time  to  architect  a  next- 
generation  system  that  an- 
swers people's  questions,  an 
end-to-end  system  that  will 
leapfrog  what's  out  there  to- 
day," says  Medhi.  Subsequent 
stages  involve  tapping  into  the 
company's  unique  advan- 
tages—the software  used  by 
hundreds  of  millions  of  people 
to  run  their  computers  and  cre- 
ate their  documents.  To  Mi- 
crosoft, search  will  involve  ev- 
erything on  your  own  machine 
and  other  databases  to  which 
you  have  access.  Gates  has  re- 
cendy  been  demoing  a  pro- 
gram out  of  his  research  division  called 
Stuff  I've  Seen,  which  uses  "memory  land- 
marks" to  search  through  e-mails,  pho- 
tographs and  documents. 

The  next  step  might  well  be  called 
"Stuff  I  Should  See."  It  involves  another 
process  cooked  up  by  its  think-tank  people 
called  Implicit  Query.  "Too  often,  search- 
ing means  stopping  what  you're  doing, 
open  a  browser  and  type  in  a  query,"  says 
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A9,  a  new  search  company  formed 
by  Amazon.com  that  will  focus  on 
e-commerce.  Ten  years  from  now, 
what  we're  doing  now  will  look  pret- 
ty primitive." 

Sergey  Brin  agrees.  "I  think 
we're  pretty  far  along  com- 
pared to  10  years  ago,"  he 
says.    "At    the    same    time, 
where  can  you  go.'  Certainly  if 
you  had  all  the  world's  informa- 
tion directiy  attached  to  your  brain, 
or   an    artificial    brain    that   was 
smarter  than  your  brain,  you'd  be 
better  off.  Between  that  and  today, 
there's  plenty  of  space  to  cover." 

Indeed,  over  the  next  few 
years  search  will  evolve  in  a 
number  of  key  areas,   and 
Google  faces  big  competition 
in  all  of  them. 

DEEP  CONTENT.  Searching  tiie 
Web  can  yield  amazing  results,  but 
they're  still  limited  and  skewed. 
"What's  on  the  Web  is  extremely 
ephemeral,"       says       Brewster 
Kahle.  "Very  littie  of  it  was 
written  before  1995."  Amazon 
took  a  giant  step  to  address  this 
with  its  Search  Inside  the  Book 
feature  that  lets  people  query  a  li- 
brary of  120,000  tomes.  Despite 
the    pay-for-content    controversy. 
Yahoo's  CAP  is  an  intriguing  at- 
tempt to  lure  content  providers  not 
on  the  public  Web  to  submit  to  its 
indexes.  "It  might  take  a  decade 
or  two  to  put  all  the  world's  in- 
formation into  Google  and  do 
things  with  it,"  says  engineering 
VP  Wayne  Rosing.  "But  it's  an 
achievable  goal." 

MULTIMEDIA.  Google  has  an  Im- 
age Search  flinction  with  almost  a 
million  pictures.  Microsoft  re- 
searchers in  China  are  going  full  blast  to  cre- 
ate software  that  searches  through  pic- 
tures—possibly identifying  faces  and 
locations.  Meanwhile,  a  Washington,  D.C., 
start-up  called  Streamsage  has  created 
breakthrough  technology  that  searches  au- 
dio and  video  broadcasts  by  analyzing 
speech.  And  AOL,  whose  search  strategy  is 
to  build  features  on  top  of  Google  technolo- 
gy, recentiy  bought  an  audio-video  search 
operation  called  SingingFish. 

PERSONALIZATION.    A    search    engine 


Many  Faces  of  Google 

The  search  engine  that  everybody  loves  takes  time  out  to 
recognize  special  events  and  holidays  with  customized  logos. 
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that  knows  you're  a  sports-car  buff  is  more 
likely  to  give  you  auto  sites  when  you  query 
the  word  Jaguar.  Google  here  is  at  a 
disadvantage  compared  with  places  like 
Yahoo  and  Amazon,  which  know  a  lot 
about  their  customers. 

LOCALIZATION.  Last  week  Google  intro- 
duced its  local  search,  which  produces  a 
map  when  you  type  in  a  category  (say 
"restaurants")  and  a  ZIP  code.  But  again, 
Yahoo  and  MSN  have  loads  of  information 
about  where   its  users  live.  The  break- 


through here  might  come  in  a 
marriage  of  search  engines  and 
cell  phones. 

ARTIFICIAL  INTELLIGENCE.  "The 
ultimate  goal  is  to  have  a  computer 
that  has  the  kind  of  semantic 
knowledge  that  a  reference  librari- 
an has,"  says  Google's  director  of 
technology  Craig  Silverstein.  But 
truly  smart  search  engines  are 
probably  decades  away. 

Google's  plan  to  keep  up 
in  these  areas  is  to  unleash 
its  brain  power  in  two  ways. 
First,    its    engineers    try    to 
whittle  down  a  rolling  list  of  the 
Top    100   tasks,   determined   by 
Brin,  Page  and  other  top  execs. 
Then,  as  dictated  by  Google's  self- 
professed  "bottom-up"  manage- 
ment style,  those  wizards  are 
permitted  to  spend  20  per- 
cent of  their  time  working 
on  projects  of  their  choos- 
ing. Often  these  ideas  wind 
up    becoming    part    of    the 
Google  collection  of  features,  as 
was   the    case   for   the   popular 
Google  News.  Another  breakout 
project    was    Orkut,    a    social- 
networking   service   designed 
by  a  young  engineer  named 
Orkut     Buyukkokten.     "My 
dream  is  to  connect  all  Inter- 
net users  so  they  can  all  relate  to 
each  other,"  he  says. 

Typical  Google  big-think.  But 
skeptics  are  saying  that  Google's 
increasingly  varied  roster  of  serv- 
ices   shows    that   the    com- 
pany is  losing  focus.  And 
af  that  its  bottom-up  style 

causes  chronic  disor- 
ganization. CEO  Schmidt 
isn't  worried.  "I  believe  the 
disorganization  is  a  feature,"  he  says.  "The 
culture  of  companies  is  set  early,  and  if 
you  changed  it,  you'd  lose  all  of  the  great 
things.  This  model  has  worked  very 
well  for  us." 

The  confidence  is  reminiscent  of  the 
mood  at  another  Mountain  View,  Calif, 
company  in  1996:  Netscape.  Schmidt  re- 
jects the  comparison.  "The  best  check  on 
hubris,"  he  says,  "is  your  competitors."  And 
now,  the  Google  guys  have  plenty  of 'em. 

With  BRAD  STONE  in  San  Francisco 
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Ballot  Boxes 
Go  High  Tech 

From  touch  screens  to  digital  'frogs,'  technology  to 
make  voting  more  secure  is  tricky,  but  it's  coming 


BY  STEVEN  LEVY 

THE  FLORIDA 
election  debacle  in 
2000  brought  us 
face  to  face  with 
some  bad  news: 
common  voting 
technology  can  be  untrustwor- 
thy. Many  state  and  local  election 
officials  were  already  moving  to- 
ward what  they  thought  was  the 
answer:  sleek  electronic  touch- 
screen voting  terminals  where 
confusion  would  be  eliminated 
by  conflision-free  ATM-like 
technology.  Congress  sped  up 
the  process  by  passing  the  Help 
America  Vote  Act  in  2002,  which 
partly  pays  for  the  machines. 
Now  the  devices,  made  by  major 
election  suppliers  like  Diebold 
and  Sequoia,  are  in  30  states  (the 
only  way  to  vote  in  Georgia  and 
Maryland),  and  will  be  used  by 
about  28  percent  of  the  country 
in  the  November  elections.  But 
in  recent  months,  computer  sci- 
entists and  security  experts  have 
uncovered  weaknesses  in  these 
gizmos.  Many  now  claim  that  it's 
entirely  possible  to  hack  an  elec- 
tion—deleting electronic  votes 
as  if  they  were  misspellings  in  a 
word  processor,  or  doing  a  cut- 
and-paste  from  one  candidate 
to  another— without  anyone's 
knowing  it.  That's  because 
there's  no  way  to  ensure  that  the 
choices  punched  on  the  screen 
will  actually  be  reflected  in  the 
final  tally.  Many  experts  are 
concluding  that  touch  screens, 
the  alleged  voting  technology  of 
the  future,  are ...  untrustworthy. 

A  new  set  of  players  in  the 
election  arena— computer  scien- 
tists and  cryptographers— are 
now  developing  systems  to  let 
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Even  Google  can't  think  of  everything.  A  host  of  start-ups  are 
working  to  fill  niches  and  capitalize  on  the  search  boom. 


BY  BRAD  STONE 

A  GOOGLE  SEARCH 
for  the  phrase  "ap- 
ple tree"  draws  2.4 
million  results,  all 
tucked  into  an  end- 
less, impenetrable  catalog  of 
blue  links.  Entrepreneur  R.  J. 
Pittman  thinks  that's  a  few  too 
many.  "Traditional  search  en- 
gines don't  solve  the  informa- 
tion-overload problem,"  he 
says.  His  Sausalito,  Calif,  start- 
up, Groxis,  is  working  on  a  so- 
lution. Its  downloadable  soft- 
ware tool,  Grokker,  sits  on  the 
desktop,  plugs  queries  into  the 
major  search  engines  and  uses 
home-cooked  algorithms  to  an- 
alyze the  pages  and  organize 
them  into  categories.  Then  it 
renders  those  categories  on  the 
screen  in  an  easy-to-parse, 
graphical  display  of  circles  and 
squares.  Grokker  is  available  on 
the  Net  for  $50  while  the  com- 
pany tests  a  free,  ad-supported 
version.  "Search  on  the  Internet 
needs  to  graduate  to  the  next 
level,"  Pittman  says. 

Groxis  isn't  alone  in  that  en- 
deavor. Over  the  past  few  years, 
dozens  of  start-ups  have  fol- 
lowed in  the  wake  of  the  search 
giants  like  the  pilot  fish  that 
travel  with  sharks,  hoping  to 
feed  on  leftovers.  Thanks  to  the 
success  of  Google,  the  search 
ocean  is  now  large  enough  to 
support  many  of  these  smaller 
life  forms.  Securities  firm  Piper  Jaffray  pre- 
dicts global  revenue  from  search  engines 
will  grow  to  $8.9  billion  in  2007,  up  from 
$2.6  billion  today.  Though  big  players  like 
Google,  Yahoo  and  Microsoft  will  take  the 
biggest  bites,  there's  still  plenty  lefl:  for  up- 
starts with  unique  technologies  and  fresh 
approaches.  "The  exciting  thing  for  me  is 
that  this  industry  is  so  young.  There's  lots  of 
innovation  left  to  be  done,"  says  Eric 
Madick,  CEO  of  a  two-year-old  search-mar- 
keting firm  called  Industry  Brains. 

In  fact,  there  are  so  many  start-ups  en- 
tering the  search  fray  these  days  that  to  sort 
through  them  you  almost  need  an,  umm. 


ENGINE 


VIVISIMO 


TOPIX.NET 


CONETEQ 


FEEDSTER 


Clusters  search  results  into  meaningful  categories. 
eBay  uses  it  to  sort  auction  outcomes. 

Credit  ex-  Netscapers  for  the  ability  to  automatically 
build  pages  around  4,000  online  news  sources. 

A  Lebanese  project  (to  be  launched  later  this  year)  will 
let  you  search  products  by  brand,  price  and  location. 


Allows  searches  of  the  thousands  i 
and  ranks  results  by  date. 


■personal  Web  logs 


search  engine.  Eurekster,  launched  in  Jan- 
uary, mixes  search  with  social  networking, 
where  you  make  online  connections  to 
friends  and  business  associates,  and  deliv- 
ers results  based  partially  on  what  those 
people  found  usefial  in  their  related  search- 
es. Another  effort,  Nutch,  is  an  open-source 
search  project;  programmers  around  the 
world  freely  contribute  to  its  code.  One  of 
its  cool  planned  features  is  letting  searchers 
tinker  with  the  parameters  of  the  search  al- 
gorithm. For  instance,  they  can  tell 
the  search  engine  to  focus  only  on  the  num- 
ber of  times  a  search  keyword  appears  in- 
side Web   pages   and   to   ignore   other. 


possibly  irrelevant,  factors. 
One  popular  objective  for 
new  start-ups  is  tackling  the 
"deep  Web,"  the  terabytes  of  ter- 
rain that  exist  in  the  databases 
of  government  sites,  medical 
firms  and  online  stores.  By  some 
estimates,  these  Web  pages  ac- 
count for  more  than  90  percent 
of  the  entire  Net,  but  the  index- 
ing software  robots  of  the  major 
search  engines  like  Google  have 
no  access  to  them.  Jason  Weiner 
of  Chicago-based  Dipsie  claims 
he's  cracked  the  problem;  he 
says  the  Dipsie  crawler  gets  to 
all  that  hidden  content  by  acting 
like  a  human  user  who  is  brows- 
ing through  the  database  one 
page  at  a  time.  Dipsie  is  set  to 
launch  later  this  year.  Bright- 
planet,  in  Washington,  D.C., 
has  a  similar  plan,  and  already 
serves  paying  customers  like  the 
South  Dakota  government, 
which  uses  the  search  technolo- 
gy to  let  the  public  scour  state 
databases.  Brightplanet  features 
another  nifty  innovation:  it  re- 
members the  results  of  each 
search  so  an  Internet  user  can 
build  on  past  research. 

Former  Lycos  CEO  Bob 
Davis,  now  a  venture  capitalist, 
says  the  best  opportunities  for 
new  companies  are  in  the  area 
of  "search  marketing."  Yahoo's 
Overture  division  pioneered 
this  business  and  Google  built 
on  it,  selling  ads  across  a  wide 
network  of  sites.  Davis  recendy 
invested  in  New  York  City  start- 
up Quigo,  which  analyzes  Web 
pages  much  like  a  search  en- 
gine, but  uses  the  results  to 
match  the  pages  wdth  relevant 
ads.  For  instance,  if  a  blogger 
writes  about  sodas,  automated 
Quigo  software  would  know  to 
stick  a  Coke  ad  on  the  page.  Quigo  technol- 
ogy was  initially  developed  in  Tel  Aviv. 

Another  international  innovator,  Aus- 
tralian Liesl  Capper,  launched  her  start-up. 
Mooter,  last  October;  it  tries  to  decipher  the 
implicit  meaning  of  a  search— whether 
someone  is  looking  for  election  results  or 
vacation  rentals  when  searching  for  the 
word  "Italy,"  for  example.  "People  keep  ask- 
ing me,  why  do  we  need  another  search  en- 
gine?" Her  answer:  "Finding  information  is 
a  basic  human  need.  We  need  to  keep  doing 
the  job  better,  with  less  pain  all  around." 
Dozens  of  other  entrepreneurs  are  all 
swimming  in  the  same  direction.  ■ 


MARCH  29,  2004    NEWSWEEK      59 


COPYRIGHT 

Suit  Seeks 
Public  Access 
For  'Orphan 
Works'  of  Art 

■  A  professor  argues  that  two 
laws  extending  protection  to 
works  long  out  of  circulation 
are  unconstitutional. 

By  Xenia  P.  Kobylarz 

Daily  Journal  Staff  Writer 

Stanford  Law  professor  Lawrence  Lessig  is 
taking  another  stab  at  overturning  a  federal  law 
that  extended  copyright  protections  for  as  long 
as  70  years.  And  his  weapon  is  a  U.S.  Supreme 
Court  ruling  that  upheld  the  extensions  in  a 

e  Lesag  argviecj  gnd  lost  rnore  than  a  year 


■  ORPHAN:  Experts 
Doubt  Suit's  Chances 


I! 


In  a  lawsuit  filed  Monday  in  U.S.  District 
Court  in  San  Francisco,  Lessig  claims  the  1998 
Copyright  Term  Extension  Act  and  the  Berne 
Convention  Implementation  Act  of  1992,  which 
gave  force  in  the  United  States  to  an  interna- 
tional treaty  extending  copyright  protections, 
are  unconstitutional. 

The  plaintiffs  are  Brewster  Kahle  and 
Richard  Prelinger,  San  Francisco  residents 
who  operate  Internet  archives  containing  vari- 
ous kinds  of  artistic  works  in  the  public 
domain. 

Lessig  said  the  statutes  have  severely 
restricted  access  not  just  to  copyrighted  works, 
but  also  to  so-called  orphan  works  —  copy- 
righted materials  that  have  long  been  out  of  cir- 
culation, such  as  out-of-print  books  or  old 
movies  that  have  not  been  restored  or  pre- 
served. 

The  law  requires  archivists  and  others  to 
apply  to  copyright  holders  for  permission  to 
use  such  material.  But  orphan  works  have  no 
registered  owners,  so  if  s  general^  impossible 
to  secure  permission,  Lessig  said. 

"Because  of  the  indiscriminate  nature  of 
copyright  today,  the  burden  of  copyright  regu- 
lation extends  to  work  whether  or  not  the  origi- 
nal author  has  any  need  for  continuing  protec- 
tion," Lessig  argued  in  his  26-page  brief.  That 
unnecessary  burden  blocks  tiie  cultivation  of 
our  culture  and  the  spread  of  knowledge." 

It  is  a  more  limited  argument  than  Lessig 
attempted  in  Eldred  v.  Ashcroft,  537  U.S.  186 
i(2003),  in  which  the  justices  upheld  the  copy- 
right extension  outlined  in  the  federal  law  that 
sometimes  is  referred  to  as  the  Sonny  Bono 
Act 

Then,  Lessig  argued  that  in  indefinitely 

extending  copyright  protections.  Congress 

essentially  violated  the  free-speech  rights  of 
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people  who  odierwise  might  use  materi- 
al whose  copyrights  would  have  lapsed. 

The  court  held  that  "when  ... 
Congress  has  not  altered  the  traditional 
contours  of  copyright  protection,  further 
First  Amendment  scrutiny  is  unneces- 
sary." 

"By  implication,"  Lessig  wrote  in  his 
new  brief,  "when  Congress  does  alter 
'the  traditional  contours  of  copyright  pro- 
tection, further  First  Amendment  scruti- 
ny' should  be  necessary." 

Congress  did  so  by  abolishing  the  ti-a- 
dition  that  only  those  who  could  demon- 
strate a  need  for  copyright  protection  by 
registering  or  renewing  a  work  deserved 
the  benefits  of  such  protection,  Lessig 
said. 

"Our  clients  have  a  fundamental  desire 
for  everyone  to  have  access  to  our  cul- 
ture and  our  past,  but  because  of  the 
unconditional  imposition  of  the  current 
law  on  any  kind  of  copyrighted  material 
they  are  unable  to  do  that,"  Lessig  said. 

In  the  past,  he  said,  orphan  works  typ- 
ically entered  the  public  domain  and 
.  could  be  preserved  by  archivists  such  as 
his  clients. 

In  1976,  Congress  abolished  tiie  regis- 
tration and  renewal  requirements  for 
works  created  on  or  after  January  1978. 
The  Berne  Convention  act  extended  that 
principle  to  orphan  works.  The  law 
extended  the  lengths  of  existing  copy- 
rights, again  including  orphan  works,  to 
as  long  as  70  years. 

As  a  result,  the  flow  of  information  into 
the  public  domain  has  been  severely 
hampered,  Lessig  said. 

"Hie  Supreme  Court  said  in  Eldred 
that  tradition  matters,  and  we're  saying 
now  that  the  tradition  was  radically 
changed,"  Lessig  said.  "All  we're  asking 
is  that  the  court  void  the  current  laws 
based  on  the  Eldred  opinion  and  require 
Congress  to  stick  to  the  traditional  goals 
of  copyright  laws." 


Lessig  doubts  Congress  could  pass 
such  sweeping  copyright  reforms  now. 

"Under  heightened  public  scrutiny,  it's 
not  clear  such  a  law  would  pass,"  he  said. 

Other  copyright  experts  doubted 
Lessig  can  prevail.  ,> 

According  to  Mark  Radcliffe,  a  partner 
at  Gray  Gary  Ware  &  Freidenrich,  the 
Eldred  court  was  pretty  clear  that 
Congress  can  amend  copyright  law  as  it 
sees  fit 

"It  is  definitely  a  different  approach," 
Radcliffe  said  of  Lessig's  argument  "But 
1  think  they  have  a  pretty  tough  row  to 
hoe.  I  don't  see  how  they  can  make  an 
argument  that  abolishing  the  renewal 
requirement  in  the  copyright  law  is 
unconstitutional." 

Said  Lessig,  "It  all  depends  on  how 
carefijUy  the  court  reads  our  argument 
and  applies  the  Supreme  Court  opinion. 
I'm  quite  confident  that  when  we  get 
before  the  judge  he'd  be  able  to  look  at 
our  case  differentiy." 

In  his  recently  published  book  "Free 
Culture,"  Lessig  has  recounted  the  mis- 
takes he  made  in  arguing  the  Eldred  case. 
He  blames  himself  for  raising  theoretical 
arguments  ratiier  tiian  discussing  the 
practical  consequences  of  the  case.  He 
said  he  won't  make  the  same  mistake 
again  if  he  gets  before  the  Supreme 
Court  this  time. 

Still,  Lessig  is  cautious.  He  also  is  helpv- 
ing  to  push  bipartisan  legislation  that 
would  let  works  enter  the  public  domain 
unless  the  owner  expressly  renews  the 
copyright 

The  proposed  Public  Domain 
Enhancement  Act  is  cosponsored  by  Rep. 
John  Doolittle,  R-Rocklin,  and  Zoe 
Lofgren,  D-San  Jose. 

'The  last  case  took  four  years  to  get  to 
the  Supreme  Court  and  Congress  might 
moot  the  case  by  passing  a  law  that  effec- 
tively imposed  a  renewal  requirement," 
Lessig  said. 
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Ashley  and  the  BookMobile 


Interviews 

Bringing  bool<s  on  wheels,  online,  to  those  who  need  them 

Interview  with  Ashley  Rindsberg,  Internet  Archive  Bookmobilist 

Ashley  Rindsberg  is  a  graduate  of  Cornell  University  where  he  earned  a 
B.A.  in  Philosophy  and  a  B.A.  in  Science  and  Technology  Studies,  focusing 
on  the  Philosophy  of  Science  and  Innovation  Theory.  In  2001,  he  began 
worl<ing  with  the  History  of  Recent  Science  and  Technology  Project  at 
MIT's  Dibner  Institute  to  digitize  the  paper-copy  archive  at  the  Cornell 
Center  for  Materials  Science.  He  has  taught  and  tutored  writing  and  has  a 
deep  interest  in  literature.  Currently  he  is  developing  and  piloting  the 
Internet  Archive's  Internet  Bookmobile  in  San  Francisco,  USA. 

Virtual  Activism  recently  interviewed  him. 

Virtual  Activism:  What  is  the  history  of  the  Internet  Archive  and  in 
particular  of  the  Book  Mobile  Project?  and  what  is  its  vision? 

Asheiy  Rindsberg:  The  Internet  Archive  was  founded  in  1996  by 
Brewster  Kahle  who  foresaw  the  need  to  preserve  quickly 
disappearing  webpages  in  order  for  a  history  and  library  of  the 
web  to  be  available.  The  Archive  grew  rapidly  and  began  to 
preserve  other  media  like  movies,  music,  television,  and 
books.  In  2001,  the  Internet  Bookmobile  was  launched  as  a 
means  of  making  our  collection  of  digital  books  useful  to  as 
many  people  as  possible.  The  Internet  Bookmobile  then 
became  a  way  of  serving  the  public  with  inexpensive  public 
domain  books  that  could  be  printed  anywhere,  on  demand. 

VA:  Who  are  your  target  audience  for  the  book  mobile? 

AR:  Most  generally,  the  target  audiences  are  communities  that  th    1 1       h    d     •     f 

are  underserved  by  mainstream  publishing.  Specifically,  we  ^^^  Uganda  Project 

try  to  get  books  to  children  (who  read  in  various  languages),  underprivileged  communities, 
communities  of  artists,  and  communities  who  want  to  publish  their  own  books.  We  have  had  a 
huge  amount  of  success  in  developing  nations,  where  there  are  a  great  deal  of  people  who  fit  into 
the  above  categories. 

VA:  Where,  if  at  all,  have  you  implemented  the  book  mobile  project?  how  do  you  assess  its  success?  can  you 
give  us  a  'success  stories'? 

AR:  There  are  now  Internet  Bookmobiles  in  San  Francisco,  California;  Hyderabad,  India; 
Alexandria,  Egypt;  and  most  recently,  Kampala,  Uganda.  We  determine  the  success  of  the 
Bookmobiles  by  how  many  books  they  are  able  to  put  into  the  hands  of  readers. 

Most  recently  in  Uganda,  the  Internet  Bookmobile  began  visiting  schools  surrrounding  the  capital, 
Kampala.  The  school  children,  most  of  whom  had  never  owned  a  book,  were  delighted  with  not 
only  received  books,  but  helping  out  in  the  printing  and  binding  of  their  books.  The  Bookmobile, 
Uganda,  now  travels  regularly  to  schools  in  that  region. 

VA:  In  most  developing  countries,  there  is  a  high  illiteracy  rate.  Do  you  think  your  project  is  important  for 
them?  why? 
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AR:  Absolutely.  In  our  opinion,  putting  a  book  into  the  hands  of  a  child,  especially  when  the  child 
is  very  young,  is  an  enormous  stride  towards  literacy.  This  first  jDOok  can  ignite  a  thirst  that 
seems  inherent  in  people  for  knowledge  and  self-improvement.  Also,  the  Bookmobile  allows  the 
process  of  "publishing"  to  occur  in  a  single  day--that  is,  a  physical  book  can  be  scanned, 
processed,  and  then  reproduced  in  a  matter  of  hours,  allowing  people  of  various  cultures  and 
mother-tongues  to  read  books  that  are  immediately  relevant  to  their  lives.  This,  we  believe,  not 
only  creates  a  drive  to  read  but  also  makes  the  process  of  literacy  easier  to  grasp. 

VA:  Are  you  dealing  with  any  copyright  issues  and  laws?  does  that  affect  your  project  implementation? 

AR:  Copyright  does  afffect  our  mission  and  project  implementation,  though  mainly  because 
copyright  law  is  becoming  progressively  more  restrictive.  In  the  latest  blow  to  the  public  domain, 
the  Supreme  Court  again  extended  the  length  of  copyright,  thereby  slowing  the  rate  by  which 
books  becomes  public  domain.  However,  there  is  a  vast  collection  of  public  domain  works  which 
are  free  to  print  and  reproduce.  So,  we  print  these  books  and  avoid  any  run-in  with  copyright  law. 

VA:  You  had  been  to  the  Alexandria  Library  in  Egypt.  What  was  your  project  with  the  Library?  Was  it,  in  your 
view,  successful? 

AR:  The  project  with  the  Library  of  Alexandria  was  the 
donation  and  implementation  of  a  stationary  and  a  mobile 
Bookmobile  setup. The  Internet  Archive  has  looked  to  the 
great.  Ancient  Library  of  Alexandria  as  a  conceptual 
model  and  now  looks  to  the  new  Library  of  Alexandria  as 
an  important  partner.  Both  the  Archive  and  the  Library 
were  eager  to  see  the  Library  able  to  print  its  own  books 
(especially  as  they  are  scanning  their  own  Arabic  books  at 
an  impressive  rate)  and  to  print  books  in  the  Bookmobile 
for  people  around  Egypt. 
The  project  was  a  success  as  there  is  now  a 
Bibliotecha-Alexandrina-Bookmobile  and  a  stationary  unit 
right  outside  of  the  children's  section  inside  the  Library. 

VA:  Do  you  print  only  in  English  or  do  you  print  in  other  languages: 

AR:  We  print  a  variety  of  languages.  The  Archive  has  books  in  English,  Arabic,  Temil,  Telugu, 
Hindi,  French,  German,  amongst  others.  Of  course,  the  majority  of  our  books  are  in  English  (or 
Hindi,  since  the  scanning  centers  are  in  India)  but  our  aim  is  for  each  nation  we're  involved  with 
to  be  actively  scanning  their  own  material.  This  way,  they  can  select  the  best  content  and  then 
we  can  serve  it  so  that  communities  around  the  world  have  access  to  the  literature  of  other 
cultures,  creating  an  ad  hoc  peer-to-peer  network  of  books. 

In  Egypt,  most  of  the  printing  is  in  Arabic.  Given  the  recent  history  and  culture  of  Uganda  and 
India,  many  of  the  books  there  are  printed  in  English,  mainly  because  English  is  what  people  want 
and  are  able  to  read. 

VA:  What  projects  are  you  planning  in  the  near  future? 

AR:  Now  that  we've  had  some  success  with  Bookmobiles  abroad,  we'd  like  to  see  another 
Bookmobile  in  the  US.  We  hope  to  find  a  natural  home  in  the  US  for  the  Bookmobile,  be  it  a 
library,  school,  community  orginization  etc.  We're  also  working  on  the  formatting  of  all  our  books 
so  that  anyone  with  a  computer  and  an  Internet  connection  will  be  able  to  print  these  books  for 
free. 
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4th  Annual  Search  Engine  Watch  Awards 

By  Danny  Sullivan,  Editor  &  Chris  Shemian,  Associate  Editor 

The  Search  Engine  Watch  Awards  recognize  outstanding  achievements  in  web 
searching.  The  winners  for  accomplishments  during  2003  are  below: 

Outstanding  Search  Service 
Winner:  Google 

Second  Place:  AllTheWeb  &  Yahoo 
Honorable  Mention:  Ask  Jeeves 

Best  Meta  Search  Engine 
Winner:  Dogpile 

Second  Place:  Vivisimo 
Honorable  Mention:  Mamma 

Best  News  Search  Engine 
Winner:  Google  News 

Second  Place:  Yahoo  News 

Honorable  Mention:  AltaVista  News  &  Daypop 

Best  Image  Search  Engine 
Winner:  Google  Images 

Second  Place:  AltaVista  Images 

Best  Shopping  Search  Engine 
Winner:  Yahoo  Shopping 

Second  Place:  Froogle  &  Shopping.com 
Honorable  Mention:  Kelkoo,  BizRate  &  mySimon 

Best  Design 
Winner:  Google 

Second  Place:  Yahoo  &  AllTheWeb 

Most  Webmaster  Friendly  Search  Provider 
Winner:  Google 

Second  Place:  Yahoo 

Honorable  Mention:  Inktomi  &  AllTheWeb 

Best  Paid  Placement  Service 
Winners:  Google  AdWords 

Second  Place:  Overture 

1  of  20  3/2/2004  2:13  PM 


http://searchenginewatch.com/awards/print.php/34671_3309841 


Honorable  Mention:  FindWhat,  Espotting  &  Mirago 

Best  Search  Toolbar 

Winners:  Google  &  Groowe 

Second  Place:  Alexa 

Honorable  Mention:  Copemic  Agent 

Best  Search  Feature 

Winner:  Google  Definitions  &  AIITheWeb  URL  Investigator 

Second  Place:  Google  Calculator  &  AIITheWeb  Calculator 
Honorable  Mention:  Google  Web  API  &  Ask  Jeeves  Dictionary  Search 

Best  Specialty  Search  Engine 

Honorable  Mention:  Internet  Archive,  Scirus  &  Google  Groups 


How  The  Winners  Were  Selected 

In  early  January  2004,  Search  Engine  Watch  members  were  invited  to  nominate 
search  engines  in  various  categories  for  the  4th  Annual  Search  Engine  Watch 
Awards.  They  could  choose  from  a  list  of  search  engines  that  Search  Engine 
Watch  editors  thought  were  good  for  within  a  particular  category  or  suggest 
new  services. 

In  late  January  2004,  anyone  subscribed  to  one  of  Search  Engine  Watch's 
newsletters  was  sent  a  special  email  allowing  them  to  vote  in  the  final  round. 
Each  person  was  only  able  to  vote  once  using  a  unique  voting  URL. 

Search  Engine  Watch  editor  Danny  Sullivan  and  associate  editor  Chris  Sherman 
made  the  final  decisions  about  award  winners.  Our  selections  were  influenced 
by  reader  votes,  though  the  final  decisions  over  winners  isn't  always  the  same  as 
the  voting.  More  details  about  how  decisions  were  made  are  described  in  each 
category  below. 

Please  note  that  in  most  categories,  people  were  allowed  to  name  both  a  winner 
and  a  second  place  choice.  In  the  summary  below,  we'll  often  refer  to  how  the 
voting  went  for  the  "winner"  of  a  category  versus  the  "second  place"  vote. 

Yes,  we  know,  it  makes  things  confiising.  However,  we  have  found  that  by 
letting  people  make  two  choices,  it  is  easier  to  see  the  strength  of  some 
second-tier  services  that  might  otherwise  get  drowned  out. 


Outstanding  Search  Service 


This  category  recognizes  outstanding  performance  in  helping  internet  users 
locate  general  information  from  across  the  World  Wide  Web. 

Winner:  Google 
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Second  Place:  Google  Calculator  &  AllTheWeb 
Calculator 

The  Google  Calculator  is  another  new  feature  the  company 
rolled  out  this  year,  allowing  you  to  add,  subtract,  convert 
measures  and  do  an  amazing  range  of  calculations  within  the 
Google  search  box.  The  calculator  earned  6  percent  of  the 
vote,  just  behind  Google  Definitions.  We  award  it  second 
place  based  on  its  usefulness  and  the  votes  received. 


AllTheWeb  also  offers  its  own  calculator,  first  made  available  in  mid-2002  but 
publicized  more  in  early  2003.  We  feh  it  earned  a  second  place  alongside 
Google's.  It  may  not  have  the  range  of  things  that  Google's  can  calculate  and 
convert,  but  the  instructions  for  AllThe Web's  are  much  more  clear  about  what 
exactly  it  can  and  cannot  do  -  and  what  it  can  do  will  still  be  very  usefiil  to 
many  people. 


HearciiEiiiine.\ 

lUatcli.cflmrii. 


Honorable  Mention:  Google  Web  API  &  Ask  Jeeves 
Dictionary  Search 


Honorable 
Mention 


2803 


The  Google  Web  API  is  not  a  feature  that  a  searcher  would 
use.  Instead,  in  enables  programmers  to  create  special 
applications  that  make  use  of  Google's  search  resuhs.  We  like 
the  idea  of  a  search  engine  letting  people  come  up  with 
creative  uses  for  its  data  in  this  fashion.  Applications  have 
ranged  from  providing  crossword  puzzle  clues  to  the 
popularity  of  movies.  Though  the  Google  API  program  was  launched  in  2002, 
we  felt  it  really  became  popularized  last  year  and  deserving  of  an  honorable 
mention. 

Ask  Jeeves  Dictionary  Search  is  also  similar  to  Google  Definifions,  in  that  if 
you  enter  a  search  for  "define"  followed  by  what  you  want  to  look  up,  you  have 
access  to  definitions  (click  here  to  see  an  example).  You  can  find  a  single 
definition,  get  access  to  definitions  from  many  dictionaries,  search  reference 
material  or  browse  specialty  dictionary.  It  provides  easy  access  to  a  great  set  of 
reference  links. 

Unfortunately,  we  don't  like  that  the  material  pops  up  in  a  frame,  and  the  feature 
isn't  documented  at  all  on  the  site.  So,  it's  not  quite  in  winner  category  as  with 
Google  Definitions,  but  it  does  deserve  honorable  mention. 

Best  Specialty  Search  Engine 

We  decided  not  to  issue  first  or  second  place  awards  in  this  category,  because  of 
the  277  write-in  votes  received,  the  vast  majority  were  for  specialized  search 
engines  that  received  awards  last  year. 

The  voting  form  did  suggest  reviewing  last  year's  winners  for  ideas  about  what 
specialty  search  engines  to  nominate.  But  instead  of  getting  new  suggestions,  it 


17  of  20 


3/2/2004  2:16  PM 


http://searchenginewatch.eom/awards/print.php/34671_3309841#toolbar 


seems  to  have  simply  reinforced  all  the  same  services  that  were  already 
recognized. 

Don't  get  us  wrong  ~  all  the  services  we  recognized  last  year  are  excellent,  and 
the  high  number  of  votes  attest  to  their  popularity.  Do  make  use  of  them!  But  to 
make  this  category  more  meaningful,  we'll  review  ideas  on  how  to  better 
present  a  selection  of  tools  to  place  in  front  of  voters  for  next  year. 

Here  were  the  top  choices,  all  of  which  do  earn  honorable  mentions  for  gaining 
8  percent  or  more  of  the  votes  cast.  All  others  earned  4  percent  or  less. 
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Honorable  Mention: 

Internet  Archive  ( 1 1  percent  of  vote) 
Scirus  (10  percent  of  vote) 
Google  Groups  (8  percent  of  vote) 

To  learn  more  about  these  services,  see  last  year's  awards. 


General  Comments 

The  last  question  on  the  voting  form  allowed  people  to  leave  general  comments. 
Here's  a  sampling: 

•  Search  engines  are  better  than  ever.  More  useful  to  consumers  and 
citizens,  more  an  integral  part  of  our  world.  It's  a  little  unfortunate  that  so 
many  users  consider  only  Google  (although  Google  is  great),  but  that  may 
change  and  the  pressure  that  the  other  engines  keep  on  Google  by  their 
constant  improvement  makes  everyone  better  off. 

•  SE  firms  need  a  means  or  being  able  to  regionalize  their  results  better  - 
perhaps  by  incorporating  some  kind  of  area  code  tag  in  the  MET  A  info.  If 
I'm  searching  for  info  on  Cancer  I'll  get  all  of  the  standard  type  results  but 
if  I'm  looking  for  Cancer,  NY  maybe  a  means  of  ONLY  obtaining  listings 
in  NY  would  be  preferred,  via  a  '212'  type  tag. 

•  Looking  forward  to  Yahoo's  melding  of  its  search  properties  and  use  of 
Inktomi  in  place  of  Google  results.  Dreading  MSN  search  entry  into  field 
as  they  destroy  competitors,  rather  than  compete  against  them.  Looking 
forward  to  development  of  Nutch  as  an  alternative  and  specialty  search  as 
an  outgrowth  of  the  top  three  dominant  engines.  MSN,  Yahoo,  Google 
will  dominate,  now  who  innovates? 

•  I've  been  an  online  marketing  consultant  since  1997.  IMHO  Google  has 
made  a  number  of  very  bad  strategic  decisions,  and  the  search  landscape 
is  about  to  change  dramatically.  The  old  adage  of  "don't  fix  it  unless  it's 
broke"  still  applies,  and  not  only  have  they  attempted  to  completely 
change  everything,  but  they've  now  created  an  underlying  current  of  ill 
will  and  resentment  from  many  webmasters,  SEOs  and  small  business 
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By  JOHN  MARKOFF 

B  ALO  ALTO,  Calif. 

AT  the  World  Economic  Forum  in  Switzerland  last  week,  Microsoft,  the  software 
heavyweight,  and  Google,  the  scrappy  Internet  search  company,  eyed  each  other  like 
wary  prizefighters  entering  the  ring. 

Bill  Gates,  the  chairman  of  Microsoft,  stated  his  admiration  for  the  "high  level  of 
I.Q."  of  Google's  designers.  "We  took  an  approach  that  I  now  realize  was  wrong,"  he 
said  of  his  company's  earlier  decision  to  ignore  the  search  market.  But,  he  added 
pointedly,  "we  will  catch  them." 

The  four  top  Google  executives  attending  the  forum,  at  the  ski  resort  of  Davos,  were 
no  less  obsessed  with  Mr.  Gates's  every  move.  "We  had  many  opportunities  to  see 
Bill  and  Microsoft  here  in  Davos,"  Eric  E.  Schmidt,  Google's  chief  executive,  wrote 
in  an  e-mail  message  to  a  colleague  that  was  distributed  to  employees  through  an 
internal  company  mailing  list. 

Microsoft  is  intently  poring  over  Google's  portfolio  of  patents,  hunting  for  potential 
vulnerabilities,  Mr.  Schmidt  contended.  And  because  Google  is  running  its  business 
using  Linux  -  the  free  open  source  soflAvare  that  has  become  the  biggest  challenger  of 
Windows  -  Microsoft  is  concerned  that  it  may  be  at  a  competitive  disadvantage. 
"Based  on  their  visceral  reactions  to  any  discussions  about  'open  source,' "  Mr. 
Schmidt  wrote  in  his  e-mail  message,  "they  are  obsessed  with  open  source  as  a 
business  model." 

Get  ready  for  Microsoft  vs.  Silicon  Valley,  Round  2. 

The  last  time  around,  in  the  mid-1990's,  Netscape  Communications,  another  brash, 
high-tech  start-up  from  the  Bay  Area,  commercialized  the  Web  browser,  touching  off 
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the  dot-com  gold  rush.  The  company  told  anyone  who  would  listen  that  its 
newfangled  software  program  would  reduce  Microsoft's  flagship  Windows  operating 
system  to  a  "slightly  buggy  set  of  device  drivers." 

As  it  turned  out,  Microsoft  -  based  in  the  Seattle  suburb  of  Redmond,  far  from  Silicon 
Valley,  the  heart  of  the  nation's  technology  industry  -  was  listening. 

Mr.  Gates,  belatedly  waking  up  to  the  threat  that  the  Internet  posed  to  his  business, 
aimed  Microsoft's  firepower  at  Netscape  and  flattened  his  rival,  which  was  later 
acquired  by  America  Online  and  is  now  a  shadow  of  its  former  self  in  an  obscure 
comer  of  Time  Warner. 

As  a  consequence,  however,  he  brought  a  federal  antitrust  lawsuit  down  upon  his 
company,  raising  the  specter  of  a  Microsoft  breakup.  In  the  end,  Microsoft  escaped 
with  little  more  than  a  requirement  that  it  operate  under  a  relatively  mild 
court-ordered  consent  decree. 

Today,  nearly  everyone  in  Silicon  Valley,  from  venture  capitalists  and  chip  engineers 
to  real  estate  agents  and  restaurateurs,  has  begun  to  ask:  Will  Google  become  the  next 
Netscape? 

Mr.  Gates,  who  for  more  than  a  decade  has  promised  -  but  not  yet  delivered  - 
"information  at  your  fingertips"  for  his  customers,  has  decided  that  the  Internet  search 
business  is  both  a  serious  threat  and  a  valuable  opportunity. 

The  co-founder  and  now  the  chief  software  architect  of  his  company,  Mr.  Gates 
readily  acknowledges  these  days  that  Microsoft  "blew  it"  in  the  market  for  Internet 
search.  Despite  his  early  grand  vision,  he  displayed  little  inclination  to  deploy 
software  that  would  improve  the  ability  of  computer  users  to  find  information  -  until 
he  saw  the  dollars  in  the  business. 

THAT  opportunity  fell  to  two  Stanford  computer  science  graduate  students,  Sergey 
Brin  and  Larry  Page,  who  disregarded  the  industry's  common  wisdom  that  search 
technology  would  become  an  inexpensive,  marginal  commodity. 

While  the  Internet's  dominant  companies  fought  one  another  over  Web  portals,  the 
promise  of  e-commerce  and  access  to  providers  like  America  Online,  Google 
developed  a  speedy  search  engine  that  soon  became  almost  a  universal  first  step  onto 
the  Internet.  It  displaced  earlier  search  engines  because  the  technology  invented  by 
Mr.  Brin  and  Mr.  Page  did  a  measurably  better  job  in  returning  results  that  satisfied 
Web  surfers'  requests. 


2  of?  2/2/2004  2:18  PM 


The  Coming  Search  Wars 


As  a  result,  Google  now  has  an  immense  number  of  users,  with  200  million  searches 
on  an  average  day.  That  gives  it  a  great  advantage  over  its  competitors,  which  are 
now  trying  to  catch  up. 

"The  system  that  has  the  most  users  benefits  the  most,"  said  Nancy  Blachman,  a 
computer  scientist  and  author  of  an  independent  guide  to  using  Google 
(www.googleguide.com).  "Microsoft  faces  a  tremendous  challenge  because  Google 
fine-tunes  its  system  by  watching  how  users  adjust  their  queries." 

But  Google  has  done  more  than  develop  a  smart  new  technology.  Unlike  many 
dot-com  flameouts  of  the  1990's,  it  has  also  figured  out  how  to  turn  it  into  a  highly 
profitable  business.  The  company  demonstrated  that  focused  ads  based  on  key  words 
related  to  Web  surfers'  search  requests  are  the  most  effective  form  of  online 
advertising. 

That  has  ignited  a  three-way  battle  among  Microsoft  and  its  two  Silicon  Valley  rivals: 
Yahoo,  based  in  Sunnyvale,  Calif,  and  Google,  whose  headquarters  are  nearby,  in 
Mountain  View.  Underscoring  the  importance  of  search  engines  to  Internet 
advertising,  Yahoo  recently  said  it  planned  to  end  its  exclusive  reliance  on  Google  for 
search  results  and  had  established  its  own  research  lab  to  try  to  cut  its  new  rival's 
lead. 

Google's  financial  success  is  clear.  In  2001,  the  company  had  virtually  no  revenue;  in 
the  past  year,  it  recorded  sales  of  almost  $1  billion  and  profits  of  about  $350  million, 
according  to  several  executives  familiar  with  the  company's  private  financial  figures. 

As  for  Microsoft,  its  executives  have  already  begun  boasting  about  sharp  revenue 
growth  from  Internet  advertising  from  its  MSN  partnership  with  Overture,  now  a 
Yahoo  division,  which  also  pioneered  Web  search  advertising.  In  its  second  fiscal 
quarter  that  ended  on  Dec.  31,  Microsoft  reported  $292  million  in  online  advertising, 
an  increase  of  47  percent  from  the  corresponding  period  a  year  earlier.  The  company 
has  said  that  its  overall  online  advertising  revenue,  which  includes  sources  beyond 
search  ads,  reached  $  1  billion  in  the  past  year. 

Later  this  year,  Microsoft  is  expected  to  unveil  its  own  search  technology,  which  Mr. 
Gates  says  will  help  Microsoft  catch  up  with  Google.  Last  week,  Microsoft  released  a 
test  version  of  a  special  set  of  software  buttons  for  its  browser  designed  to  direct 
users  to  its  MSN  search  and  related  services.  For  Google,  though,  the  greater  threat  is 
that  Microsoft  will  decide  that  Internet  search,  like  the  Web  browser  before  it,  should 
be  an  integral  part  of  future  versions  of  the  Windows  operating  system. 
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For  the  moment,  though,  Google's  lead  seems  formidable.  Last  year.  Rick  Rashid,  a 
Microsoft  vice  president  in  charge  of  the  company's  research  division,  came  to  its 
outpost  in  Silicon  Valley  to  give  a  demonstration  of  an  experimental  Microsoft 
Research  search  engine.  Shortly  afterward,  however,  Mike  Burrows,  one  of  the 
original  pioneers  of  Internet  search  at  Digital  Equipment  who  later  helped  design 
Microsoft's  experimental  search  engine,  quietly  defected.  He  joined  Google. 

But  even  if  it  can  protect  its  technological  lead,  will  Google  still  succumb  to 
Microsoft's  marketing  muscle? 

Google  shares  the  intense  Silicon  Valley  work  ethic  that  characterized  companies  like 
Netscape.  Its  new  headquarters,  on  a  spacious  campus  once  occupied  by  SGI,  a 
computer  maker,  are  just  across  the  freeway  from  Netscape's  original  base. 

But  many  veteran  Silicon  Valley  executives  are  skeptical  about  Google's  ability  to 
hold  its  corporate  culture  together  once  it  goes  public  later  this  year.  The  initial  public 
offering,  much  anticipated,  is  expected  to  create  hundreds  of  instant  multimillionaires 
among  its  regular  employees,  but  will  leave  many  others  hired  as  contractors  without 
significant  gains.  As  a  result,  some  people  fret  that  Google  is  fostering  a  class  society 
in  its  ranks. 

So  far,  though,  the  disaffection  is  limited  largely  to  the  company's  Adwords  business, 
which  is  aimed  at  creating  and  placing  its  focused  search  advertising.  That  operation 
has  grown  rapidly  with  temporary  workers.  "The  Adwords  environment  is  brutal," 
one  Google  executive  said. 

Clearly,  though,  keeping  its  ebullient  esprit  de  corps  so  robust  after  the  I. P.O.  will  be 
difficult,  say  those  who  have  gone  through  similar  roller-coaster  rides  in  Silicon 
Valley. 

"The  challenge  Google  faces  is  figuring  out  how  to  retain  a  high  rate  of  innovation" 
in  the  face  of  a  disruptive  event  like  the  I. P.O.,  said  a  former  Netscape  executive, 
who  also  worries  that  the  two  young  founders,  for  all  their  brilliance,  may  not  fit  well 
into  the  kind  of  management  team  needed  to  run  Google  as  a  fast-growing  public 
company. 

Although  Google  has  clear  vulnerabilities,  Microsoft  is  seen  in  Silicon  Valley  as  a 
powerful  but  not  particularly  creative  competitor.  Beyond  its  core  business  in  Office 
and  Windows,  Microsoft  has  no  major  recent  successes  to  point  to  -  but  it  has  a 
growing  list  of  disappointments.  These  include  its  Xbox  video  game  player  and 
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Ultimate  TV  set-top  box. 

In  other  words,  rivals  have  fought  Microsoft  and  lived  to  tell  about  it.  "At  TiVo,  we 
managed  to  stare  down  that  $40  billion  barrel,"  said  Stewart  Alsop,  a  venture 
capitalist  who  helped  finance  the  creation  of  TiVo's  digital  video  recorder,  which 
allows  TV  viewers  to  easily  record  hours  of  video  programming  for  viewing  at  other 
times.  "We  dodged  that  particular  bullet,"  Mr.  Alsop  said,  when  Microsoft  "shut 
down  Ultimate  TV  and  got  out  of  the  business." 

Other  executives  who  compete  with  Microsoft  said  Google's  position  might  be  more 
defensible  than  Microsoft  executives  believe. 

"The  good  news  for  Google  is  that  what  they  do  has  many  branches,"  said  Rob 
Glaser,  the  chief  executive  of  RealNetworks,  which  competes  with  Microsoft  in  the 
software  for  playing  video  and  digital  audio  on  personal  computers.  "It's  not  easily 
replicable  in  one  step." 

OTHERS  say  that  even  though  the  Justice  Department  consent  decree  is  weak,  it  may 
still  be  enough  of  a  barrier  to  prohibit  Microsoft  from  making  Internet  search  an 
integral  part  of  the  operating  system  in  the  same  way  it  absorbed  the  Web  browser. 

"They  can't  undercut  Google  on  price,  and  I  don't  think  they  can  get  away  with 
integrating  search,"  said  S.  Jerrold  Kaplan,  an  industry  executive  who  competed 
against  Microsoft  while  at  Lotus,  the  spreadsheet  maker  that  is  now  part  of  I.B.M. 

As  it  prepares  its  public  offering,  Google  is  trying  to  avoid  Netscape's  fate  by 
remaining  focused  on  its  own  measures  of  customer  satisfaction.  On  computers  at 
Google  headquarters,  the  home  page  constantly  displays  a  graph  reflecting  how  well 
Google  does  on  searches,  compared  with  its  competitors.  Even  the  slightest  dip  in 
performance  creates  alarm,  a  company  executive  said. 

Google  has  also  brought  in  a  Silicon  Valley  veteran,  William  V.  Campbell,  the 
chairman  of  Intuit,  to  serve  as  a  consultant.  His  gospel  for  Googlers,  as  employees 
refer  to  themselves,  is  this:  Ignore  Microsoft's  impending  arrival  as  a  competitor  and 
focus  on  the  customer. 

Good  luck.  Microsoft  has  already  begun  a  recruitment  campaign  aimed  at 
demoralizing  Google  employees,  several  Google  executives  said.  Microsoft  recruiters 
have  been  calling  Google  employees  at  home,  urging  them  to  join  Microsoft  and 
suggesting  that  their  stock  options  will  lose  value  once  Microsoft  enters  the  search 
market  in  a  serious  way. 
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"Our  approach  has  been  to  seek  out  the  best  and  brightest  talent,"  said  Lisa  Gurry,  a 
lead  product  manager  at  MSN.  "Beyond  that,  I  can't  add  anything." 

Google  executives  also  say  they  believe  that  Microsoft  is  systematically  pursuing 
Web  sites  downgraded  by  Google,  which  punishes  companies  for  trying  to 
manipulate  their  rankings.  The  company  is  striking  partnerships  with  unhappy 
Google  customers. 

Microsoft  is  currently  relying  on  Overture  for  its  paid  search  listings,  Ms.  Gurry  said. 

But  Google  is  hardly  standing  still.  As  Mr.  Gates  himself  has  acknowledged,  it  has 
marshaled  a  remarkable  collection  of  technologists.  They  are  focused  both  on  keeping 
the  company's  lead  in  search  technology  and  on  developing  a  range  of  new  services. 

To  help  their  work,  Google  has  been  quietly  developing  what  industry  experts 
consider  to  be  the  world's  largest  computing  facility.  Last  spring,  Google  had  more 
than  50,000  computers  distributed  in  over  a  dozen  computer  centers  around  the 
world.  The  number  topped  100,000  by  Thanksgiving,  according  to  a  person  who  has 
detailed  knowledge  of  the  Google  computing  data  center.  The  company  is  placing  a 
significant  bet  that  Microsoft  will  be  hard  pressed  to  match  its  response  time  to  the 
ever  increasing  torrent  of  search  requests. 

Besides  the  additional  computing  firepower,  Google  has  a  wide-ranging  list  of  new 
services  that  it  will  roll  out  as  competition  with  Microsoft  and  Yahoo  dictates.  For 
example,  it  recently  introduced  Orkut,  a  social  networking  service  intended  to 
compete  with  Friendster,  Linkedin  and  others.  Still  under  wraps  is  an  electronic  mail 
service  that  will  have  an  advertising  component. 

The  company  has  also  been  pushing  hard  to  find  new  sources  of  information  to  index, 
beyond  material  that  is  already  stored  in  a  digital  form.  In  December,  it  began  an 
experiment  with  book  publishers  to  index  parts  of  books,  reviews  and  other 
bibliographic  information  for  Web  surfers. 

And  Google  has  embarked  on  an  ambitious  secret  effort  known  as  Project  Ocean, 
according  to  a  person  involved  with  the  operation.  With  the  cooperation  of  Stanford 
University,  the  company  now  plans  to  digitize  the  entire  collection  of  the  vast 
Stanford  Library  published  before  1923,  which  is  no  longer  limited  by  copyright 
restrictions.  The  project  could  add  millions  of  digitized  books  that  would  be  available 
exclusively  via  Google. 
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ON  the  marketing  side,  the  company  is  racing  to  build  its  strengths  overseas.  Wayne 
Rosing,  vice  president  for  engineering  at  Google,  has  been  chosen  to  travel  the  world, 
weaving  the  company's  search  engine  into  local  economies  and  local  technologies.  It ' 
is  concentrating  initially  on  12  countries. 

Mr.  Page,  the  Google  co-founder,  is  even  trying  to  persuade  Mr.  Schmidt,  the  veteran 
Silicon  Valley  executive  recruited  from  Novell  Inc.,  to  run  Google,  and  others  in  the 
company  to  market  a  phone  with  a  built-in  custom  personal  digital  assistant  intended 
to  let  Web  surfers  use  Google  from  anywhere. 

For  all  of  Google's  hyperactivity,  there  is  still  a  lingering  sense  among  many  Silicon 
Valley  veterans  that  they  have  seen  this  movie  before.  The  company  may  not  have 
Netscape's  arrogance,  but  it  is  still  not  clear  that  all  of  its  clever  marketing, 
technology  and  brand  identification  can  withstand  Microsoft's  onslaught  when  it 
arrives. 

After  all,  just  as  Silicon  Valley  has  learned  from  some  of  its  errors,  so  has  Mr.  Gates. 
In  Davos,  Mr.  Gates  ruefully  acknowledged  that  Google  "kicked  our  butts," 
reminding  him  of  what  Microsoft  itself  was  like  two  decades  ago. 

"Our  strategy  was  to  do  a  good  job  on  the  80  percent  of  common  queries  and  ignore 
the  other  stuff,"  he  said.  But  "it's  the  remaining  20  percent  that  counts,"  he  added, 
"because  that's  where  the  quality  perception  is." 

He  promised  not  to  make  that  mistake  again. 
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Brewster  Kahle  on  the  Internet  Archive  and  People's 
Technology 

by  Lisa  Rein 
01/22/2004 

Brewster  Kahle  is  the  founder  and  digital  librarian  for  the  Internet  Archive  (lA).  He 
is  also  on  the  board  of  the  Electronic  Frontier  Foundation. 

The  I A  started  out  as  just  that  ~  a  non-profit  organization  dedicated  to  taking  snap 
shots  of  the  entire  Web  every  six  months,  in  order  to  create  a  searchable  archive. 

One  of  the  main  goals  of  the  Internet  Archive  is  to  provide  "Universal  Access  to  All 
Human  Knowledge."  It  sounds  like  a  lofty  task,  but  Brewster  is  firmly  committed  to 
it,  and  truly  believes  that  it  is  achievable.  Anyone  in  his  presence  for  five  minutes  or 
more  is  likely  to  feel  the  same  way,  because  his  enthusiasm  is  quite  contagious. 

Brewster  started  the  I A  in  1 996  with  his  own  money,  which  he  earned  from  the  sale 
of  two  separate  Internet  search  programs:  WAIS,  which  was  bought  by  AOL,  and 
Alexa  Internet,  which  was  bought  by  Amazon.  He  has  been  spending  his  own 
money  to  keep  the  institution  going  for  the  last  six  years.  Recently,  in  the  summer  of 
2003,  he  was  fortunate  enough  to  receive  some  grants  and  corporate  sponsorship. 

Newer  I A  projects  include  creating  an  open  source  movie  archive,  creating  a 
rooftop-based  WiFi  network  across  San  Francisco,  creating  an  archive  of  the  2004 
presidential  candidates  (offering  every  candidate  unlimited  storage  and  bandwidth  to 
serve  up  video),  and  creating  a  non-profit  documentary  archive. 

Let's  Start  with  the  Internet  Archive 

Lisa  Rein:  What's  the  story  behind  the  birth  of  the  Internet  Archive?  How  did  it 
start? 
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Brewster  Kahle:  The  Internet  Archive  started  in  1996,  when  the  Internet  had 
reached  critical  mass.  By  1996,  there  was  enough  material  on  the  Internet  to  show 
that  this  thing  was  the  cornerstone  for  how  people  are  going  to  be  publishing.  It  is 
the  people's  library.  People  were  using  the  Internet  in  a  major  way  towards  making 
things  available,  as  well  as  for  finding  answers  to  things.  And,  of  course,  the 
Internet  is  quite  fleeting.  The  average  life  of  a  web  page  is  about  100  days.  So  if  you 
want  to  have  culture  you  can  count  on,  you  need  to  be  able  to  refer  to  things.  And  if 
things  change  out  from  underneath  you  all  the  time,  then  you're  in  trouble.  So  what 
traditionally  has  happened  is  that  there  are  libraries,  and  libraries  collect  up 
out-of-print  materials  and  try  to  preserve  and  make  open  access  to  materials  that 
aren't  necessarily  commercially  viable  at  the  moment.  The  Internet  Archive  is  just  a 
library.  It  just  happens  to  be  a  library  that  mostly  is  composed  of  bits. 

LR:  How  did  you  get  the  funding  for  it? 

BK:  The  funding  for  the  Internet  Archive  came  originally  from  the  success  of 
selling  a  couple  of  Internet  companies  on  the  path  towards  building  a  library.  So  the 
original  funding  was  from  me,  based  on  selling  one  company,  WAIS,  Inc.  which 
was  the  first  Internet  publishing  system,  to  America  Online.  And  then  Alexa 
Internet,  which  was  a  company  short  for  "the  Library  of  Alexandria,"  to  try  to 
catalog  the  Web.  So  all  of  these  were  trying  to  build  towards  the  library,  and  these 
companies  were  sold  to  successful  companies  and  so  that  gave  me  enough  money  to 
kick  start  the  Internet  Archive.  At  this  point,  it's  funded  by  private  foundations, 
government  grants,  and  in-kind  donations  from  corporations. 

LR:  So  AOL  bought  WAIS  and  who  bought  Alexa? 

BK:  Amazon  bought  Alexa. 

LR:  What  are  some  of  the  grants?  Didn't  you  get  some  good  grants  lately,  during  the 
past  year? 

BK:  Oh  yes,  we've  been  very  fortunate  in  this  phase  of  the  Internet  Archive's  life. 
The  Sloan  Foundation  gave  us  a  significant  grant  towards  helping  get  the  materials 
up  and  able  to  be  used  by  researchers  all  over  the  world,  and  the  Hewlett  Foundation 
also  gave  us  a  sizable  grant  to  bring  more  digital  materials  from  a  lot  of  non-profit 
institutions  to  give  them  permanent  access. 

For  instance,  a  lot  of  organizations  create  documentaries  that  maybe  are  shown  once 
or  twice,  but  they're  not  permanently  available.  But  their  general  approach  was  to 
have  things  to  be  available.  So  by  having  a  library  be  able  to  digitize  and  host  these 
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materials,  we  hope  to  bring  a  lot  of  non-profit  materials' up  and  out  onto  the  Internet 
so  they  can  be  leveraged  and  used  by  people  all  over  the  world. 


Brewster  Kahle  speaking  at  the  O'Reilly  2003  Emerging  Technology  Conference  in 
Santa  Clara,  CA 

LR:  How  many  people  work  here  at  the  Internet  Archive  right  now? 

BK:  There  are  12  people  full-time  here  at  the  Internet  Archive  —  probably  20  if  you 
count,  all  told.  There  are  a  lot  of  people  that  come  through.  We've  got  a  programmer 
from  Norway  and  a  programmer  from  Iceland  here  now.  We  had  a  programmer 
from  Japan  that  sort  of  came  through  and  helped  intern  and  shared  the  technology 
that  they  know  and  also  what  we  know. 

LR:  What  would  you  tell  somebody  that  was  interested  in  participating  somehow? 
You're  always  looking  for  people  to  work  on  projects,  right? 

BK:  We're  always  looking  for  help.  People  are  helping  in  many,  many  different 
ways.  By  curating  collections.  By  keeping  good  web  sites.  By  making  sure  that  web 
sites  can  be  archived  —  is  how  thousands  of  people  are  helping.  But  people  are  also 
helping  curate  some  of  the  collections  that  are  here.  We  have  volunteers  that  are 
helping  with,  oh,  things  like  SFLan  and  some  of  the  technical  work  that  we  do.  But 
also,  we  are  growing  slowly  and  we  are  hiring  a  few  more  people  —  mostly  very 
technical. 
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LR:  Talk  about  SFLan  a  bit. 


BK:  SFLan  is  a  wireless  project  that  is  based  around  San  Francisco.  The  idea  is  to 
experiment  using  the  wireless  network  to  do  a  rooftop  network,  to  use  to  use 
commodity  wireless  802. 11  WiFi  stuff  to  hop  from  roof  to  roof  to  roof  to  provide  an 
alternative  to  DSL  and  cable  for  the  last  mile. 

If  we  can  make  that  both  be  open  and  have  distributed  ownership,  then  people 
would  own  the  roadways  and  they  would  basically  control  their  network,  which  is 
what  the  Internet  really  is. 

LR:  What  do  you  mean  by  "the  last  mile,"  exactly? 

BK:  Trying  to  get  the  last  piece  from  getting  from  a  central  location  where  there 
might  be  a  fiber  that  comes  to  a  city,  and  try  to  get  that  distributed  so  that  people  in 
their  homes  can  not  only  get  materials  at  video  speeds,  3-5  megabits  per  second  ~ 
DVD-like  speeds  —  but  also  act  as  servers  to  make  things  available  to  others  over 
the  Internet  at  high  speeds. 

These  are  some  of  the  things  that  are  very  difficult  to  do,  if  not  impossible,  with  the 
current  commercial  DSL  and  cable  providers.  And  we're  looking  to  see  how  we  can 
not  only  establish  that  baseline  of  video-ready  Internet  and  make  it  so  people  can 
serve  video  over  the  Internet,  but  then,  every  year,  make  it  better  by  a  factor  of  two. 
So  the  technology  follows  Moore's  Law  just  like  the  computer  guys  do,  as  opposed 
to  how  the  telecoms  tend  to  work,  which  is  "here's  the  same  thing,  and  you'll  buy  the 
same  thing,  and  maybe  we'll  raise  the  price  slightly  ..." 

LR:  And  keep  paying  more  for  it. 

BK:  Right. 

LR:  So  you're  looking  for  people  with  rooftops? 

BK:  We're  looking  for  people  with  rooftops.  And  especially  people  that  can  buy  a 
node.  A  node  costs  $1,000,  and  that's  a  little  Linux  box  with  a  directional  antenna. 

LR:  Is  that  a  node  right  there? 

BK:  This  is  a  node  right  here  (gestures).  So  this  is  an  SFLan  box.  This  is  a 
directional  antenna  that  points  upstream  back  to  a  node  that's  closer  to  the  Net.  This 
is  an  omni  antenna.  So  anyone  who  can  see  this  can  be  on  the  Internet  for  free. 
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And  this  is  a  Linux  machine  that's  got  a  CompactFlash  card  as  its  hard  drive,  and 
two  radios.  And  you  get  a  wire  that  comes  down  into  your  house,  which  is  the  way 
that  power  is  brought  up  to  this  machine.  And  also,  you  get  bandwidth  within  your 
house  or  office. 

There  are  about  23  of  these  around  San  Francisco  on  rooftops  now,  and  we're 
actively  deploying  new  soflrware.  Cliff  Cox  up  in  Oregon  is  doing  a  lot  of  the 
software  development  and  also  hardware  development.  He's  actually  the  guy  that 
sells  these  things  for  $1,000.  So  Internet  Archive's  participation  is  to  help  fund  the 
project  to  get  it  kick-started,  and  to  try  to  get  some  active  roofs  up  and  running. 

LR:  How  does  the  Internet  Archive  decide  about  implementing  new  technologies? 
What's  your  philosophy  about  implementing  new  technologies? 

BK:  The  Internet  Archive  is  extremely  pragmatic  about  new  technologies.  What  we 
tend  to  do  is  look  at  the  least  costly,  both  in  the  short  term  and  long  term.  So  we  are 
frugal  to  the  core. 

We  run  currently  about  700  computers.  They're  all  running  Linux.  We  don't  have 
any  dedicated  routers.  We  just  use  Linux  machines.  We  use  the  same  Linux  machine 
over  and  over  and  over  and  over  and  over  again.  Jim  Gray's  model  ~  he  calls  it  the 
"brick  model."  So  we  just  use  Linux  machines  stacked  up,  and  even  though  they 
might  be  storage  machines,  or  CPU  machines,  or  running  as  a  router,  or  running  as  a 
load  balancer,  or  a  database  machine  —  they're  all  just  the  same  machine.  What 
we've  found  is  that  it  allows  us  to  only  have  one  or  maybe  just  two  systems 
administrators  being  able  to  scale  to  many  hundreds  and,  we  hope,  a  few  thousand, 
machines,  by  having  such  a  simple  underlying  hardware  architecture. 

Because  we  operate  on  these  machines  stacked  up,  we  tend  to  do  everything  based 
on  clusters.  Because  our  amounts  of  data  are  fairly  large.  We  have,  oh,  several 
hundred  terabytes  at  this  point  ~  three,  four  hundred  terabytes  of  materials,  and  it's 
growing  a  lot.  So  ifs  difficult  to  process  these  if  you  have  to  go  through  just  one 
machine,  and  a  lot  of  proprietary  software  is  licensed  to  just  be  on  one  machine,  or  it 
costs  per  each. 

Open  source  has  the  ability  that  you  can  go  and  run  it  on  as  many  machines  as  you 
want.  Because  we  run  things  and  we  do  data  processing  and  conversions  on  ten 
machines  or  a  hundred  machines  at  once,  we  find  that  open  source  is  often  the  most 
pragmatic,  least  costly  way  to  roll.  We  also  find  that  it's  easiest  for  other  people  to 
copy  our  model  if  we  use  open  source  software,  so  we  tend  towards  using  open 
source  software,  because  we'd  like  anything  that  we  develop  to  be  actively  used  by 
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Others  readily  and  easily. 

LR:  How  much  do  you  test  before  going  live  with  new  services  and  things?  Do  you 
do  a  lot  of  testing? 

BK:  Do  we  do  a  lot  of  testing?  I'd  say  we  do  a  lot  of  progressive  rollouts.  We  do 
testing  in-house,  but  you  can  only  go  so  far,  and  then  you  bring  on  some  number  of 
your  users  and  bring  things  out.  I'd  say  we're  less  testing-oriented.  We're  less 
service-quality  oriented  than  a  lot  of  places,  because  we're  researching.  We're  trying 
to  push  the  edge.  So  we  try  to  make  sure  our  data  is  safe,  but  if  there  happens  to  be  a 
hiccup,  we  are  very  public  about  that,  and  we're  looking  for  help  from  others  to  help 
us  resolve  these  and  find  them.  So  I'd  say  we're  not  like  a  commercial  company 
doing  lots  of  in-house  testing  and  rounds  and  rounds  of  beta  testing,  because  we 
only  have  12  people  to  run  all  of  this. 

LR:  Can  you  remember  a  specific  situation  where  the  technology  could  have  gone 
one  way  or  the  other,  and  you  decided  on  a  certain  way  over  another  way,  and  why? 
When  there's  a  fork  in  the  road,  what  process  do  you  go  through  to  decide  which 
way  to  go? 

BK:  Boy,  when  there're  different  choices  of  which  way  to  go,  you  find  that  one  of 
the  lead  motivators  in  terms  of  how  we  decide  which  way  to  go  is  which  way  people 
believe  it  should  go.  People  are  always  open  to  testing  and  pushing  back  and  saying, 
"Why  do  you  think  that's  true?"  Especially  if  we've  tried  going  down  that  road 
before. 

Let's  take  RAID  -  Redundant  Arrays  of  Independent  Disks.  The  idea  is  to  run,  say, 
four  disks  or  eight  disks  as  a  cluster  of  disks  so  that  if  one  fails,  it  has  the 
information  on  the  other  ones,  so  that  it  doesn't  fail,  so  you  can  replace  the  disk  and 
be  able  to  keep  going.  Every  few  years  we  think  that  this  is  the  right  thing  to  do,  and 
every  few  years,  we  find,  unfortunately,  that  it  is  the  wrong  thing  to  do. 

But  it  doesn't  seem  to  keep  us  from  trying  again.  Every  so  often  we  think,  "Okay, 
they  must  have  fixed  the  bugs,"  and  that  the  software  must  be  more  reliable,  or  the 
controllers  must  be  more  reliable,  and  we'll  go  and  put  some  number  of  machines 
into  this  new  structure  and  then  watch  them  for  six  months  to  a  year  to  sort  of  see, 
"Does  it  work  better  or  worse  than  what  we  were  using  before?"  With  RAID,  we've 
found  with  two  major  tests  of  RAID  that  it's  been  a  loser. 

LR:  Why?  What  goes  wrong? 
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BK:  We're  not  exactly  sure,  but  it  looks  like  the  RAID  controllers  are  just  not 
debugged  very  well.  The  software  isn't  debugged.  The  hardware  isn't  debugged. 
There  are  failure  modes  that  fall  outside  of  there.  "Oh,"  (supposedly)  "if  one  disk 
just  goes  completely  corrupt,  then  you  can  replace  it  and  everything's  fine."  Well, 
we've  found  out  in  the  latest  Linux  release  that  if  two  disks  just  hiccup  slightly,  then 
it  gives  it  up  for  lost  and  it  says,  "You  lose  all  your  data,"  and  so  we've  had  to  spend 
months  then  going  back  and  decrypting  all  of  the  Linux  RAID  controller  file  system 
to  be  able  to  recover  all  of  the  data  that  you  can  actually  recover.  So  I  think  it's  just 
bad  implementations  based  on  not  being  able  to  get  the  reliability  up,  based  on  not 
having  enough  test  cases. 

We  go  along  with  Hillis'  Law.  Danny  Hillis  was  one  of  the  great  computer  designers 
of  all  time,  and  his  approach  was  to  have  large  numbers  of  commodity  components; 
that  basically,  price  follows  volume.  So  if  things  are  made  in  more  volume,  the  price 
is  lower.  You  can  say,  "Duh.  Obviously."  But  if  s  amazing  that  most  people  don't 
follow  this.  Particularly  that  the  price  goes  down  when  there's  more  of  it  made.  You 
want  to  use  things  that  cost  less,  because  you  might  get  more  gigabytes  per  hard 
drive  if  you're  using  commodity  components,  as  opposed  to  specialty  components. 

But  another  corollary  of  this  is  that  "reliability  follows  volume."  That  things  that  are 
made  in  large  volume  have  to  be  more  reliable,  at  least  in  the  long  haul,  otherwise 
the  company  that's  making  them  would  go  out  of  business  because  they'd  have  too 
many  failures.  Another  way  of  saying  that  is  that  Toyotas  are  more  reliable  than 
Ferraris.  Even  though  a  Toyota  might  cost  one-tenth  as  much  as  a  Ferrari,  they  are 
probably  on  the  road  more  often.  The  coupling  of  this  is  that  if  you  want  a  reliable 
system,  and  you  want  one  that  doesn't  cost  that  much,  go  for  high  volume,  if  you 
want  it  available,  reliable,  etc.  And  so  we  find  that  technologies  that  are  commodity 
and  made  in  high  volumes  work  better. 


LR:  When  you  say  "commodity,"  you  mean 
"off  the  shelf,"  or  COTS  products,  right? 

BK:  Yes. 

LR:  Let's  talk  a  little  bit  about  your  philosophy 
now.  Could  you  discuss  what  you  mean  when 
you  talk  about  "Universal  Access  To  All 
Human  Knowledge?" 

BK:  "Universal  Access  To  All  Human 
Knowledge"  is  a  motto  of  Raj  Reddy  from 
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Carnegie  Mellon.  I  found  that  if  you  really 

^11  4.         J      ^     J  ^u  *\  ^  .  .u  wireless  technologies  (mainly 

actually  come  to  understand  that  statement,  then  802. ii)  in  the  developing  world.  We 

that  statement  is  possible;  technologically  are  planning  a  Wireless  Roadshow  to 

possible  to  take,  say,  all  published  materials  -  ':^^^^X^:^'^  ^d  to 

all  books,  music,  video,  software,  web  sites  —  bring  internet  and  intranet 

that  it's  actually  possible  to  have  universal  connectivity  to  those  parts  of  the 

,,;;,„  ^         ^  ,  world  not  included  in  the  plans  of  the 

access  to  all  of  that.  Some  for  a  fee,  and  some  commercial  telecommunications 

for  free.  I  found  that  was  a  life-changing  event  companies. 

for  me.  That  is  just  an  inspiring  goal.  It's  the  O'Reilly  Emerging  Technology 

dream  of  the  Greeks,  which  they  embodied,  Conference 

with  the  Egyptians,  in  the  Library  of  ''®''I"?7=^:"'^^.°°'* 

Alexandria.  The  idea  of  having  all  knowledge 

accessible.  — 


San  Diego,  CA 


But,  of  course,  in  the  Library  of  Alexandria's  case,  you  had  to  actually  go  to 
Alexandria.  They  didn't  have  the  Internet.  Well,  fortunately,  we  not  only  have  the 
storage  technology  to  be  able  to  store  all  of  these  materials  cost-effectively,  but  we 
can  make  it  universally  available.  So  that's  been  just  a  fabulous  goal  that  causes  me 
to  spring  out  of  bed  in  the  morning. 

And  it  also  —  when  other  people  sort  of  catch  on  to  this  idea  that  we  could  actually 
do  this  —  that  it  helps  straighten  the  path.  You  know,  life,  there're  lots  of  paths  that 
sort  of  wander  around.  But  I  find  that  having  a  goal  that's  that  far  out,  but  also 
doable,  it  helps  me  keep  my  direction,  keep  our  organization's  direction.  And  I'm 
finding  that  a  lot  of  other  people  like  that  direction,  as  well. 

LR:  Do  you  have  an  overall  philosophy  about  technology  and  the  direction  in  which 
you'd  like  to  see  it  go? 

BK:  I  don't  really  have  a  philosophy  about  technology.  I  have  a  philosophy  of  what 
future  I  want  to  live  in,  which  is  probably  more  of  a  social  and  cultural  issue  than  it 
really  is  a  technological  issue.  And  socially  and  culturally,  what  I  want  to  grow  up  in 
—  and  have  my  kids  grow  up  in  ~  is  a  wonderful  flowering  of  all  sorts  of  really  wild 
ideas  coming  from  all  sorts  of  people  doing  diverse  and  interesting  things. 

What  I'd  really  like  to  see  is  a  world  where  there's  no  limitations  on  getting  your 
creative  ideas  out  there.  That  people  have  a  platform  to  find  their  natural  audience. 
Whether  their  natural  audience  is  one  person,  themselves,  or  a  hundred  people,  or  a 
thousand  people.  Try  to  make  it  so  the  technologies  that  we  develop,  and  the 
institutions  we  develop,  make  it  so  that  people  have  an  opportunity  to  flower.  To 
live  a  satisfying  life  by  providing  things  to  others  that  they  appreciate. 
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And  I  think  our  technologies  right  now  are  well-suited  to  doing  this  in  the 
information  domain.  In  the  information  domain,  we  can  go  and  offer  people  an 
ability  to  publish  without  the  traditional  restrictions  that  came  before,  and  to  help, 
with  these  search  engine  technologies,  to  help  them  find  their  natural  audiences. 
And  so  people  out  there  aren't  surrounded  by  stuff  they  don't  want.  That  they  find 
that  the  music  recordings  they  want  and  the  video  recordings  they  want,  even  though 
they're  made  a  half  a  continent  away,  and  there  are  only  a  hundred  other  people  that 
also  really  like  that  genre. 

LR:  What  kind  of  projects  are  you  working  on  with  the  Library  of  Congress? 

BK:  We've  been  working  with  the  Library  of  Congress  over  the  last  three  or  four 
years  to  help  archive  web  sites.  They've  got  a  mission  to  record  the  cultural  heritage 
of  the  United  States  —  actually  also,  Thomas  Jefferson  gave  them,  more  broadly, 
"the  world."  And  now  that  publishing  is  moving,  or  a  large  section  of  publishing,  is 
moving  on  to  the  Internet,  we've  been  working  with  them  as  a  technology  partner. 
They  do  the  curation,  and  we  do  some  special  crawls. 

Our  first  project  with  them  was  the  election  in  the  year  2000.  The  presidential 
election.  And  they  selected  a  set  of  web  sites,  and  we  crawled  them  every  day  to  try 
to  get  a  historical  record,  and  then  the  Internet  Archive  made  them  available  to  the 
world  to  see  and  use,  to  see  if  it  was  useful  to  people. 

The  Library  of  Congress  is  trying  to  move  into  the  digital  realm,  and  they  just  got  a 
hundred  million  dollars  from  Congress  to  help  do  digital  preservation,  and  we  hope 
to  be  participants  as  that  unfolds.  We'll  see.  But  the  Library  of  Congress  has  got  a 
lot  of  money  ~  a  450-to-500-million-dollars-a-year  budget.  We  hope  that  a  growing 
percentage  of  that  goes  towards  digital  materials,  whether  working  with  us  or  others, 
than  currently,  which  is  I  think  probably  less  than  one  percent. 

LR:  Earlier  you  said  that  one  way  that  people  could  help  was  to  make  their  web 
sites  "more  archivable,"  basically.  What  does  that  really  mean?  How  would  you 
make  your  web  site  easily  archivable? 

BK:  Boy.  By  being  straightforward.  I  think  by  keeping  things  fairly  simple.  If  web 
sites  have  sort  of  straightforward  links,  then  that  makes  things  a  lot  easier. 

LR:  What  do  you  mean  "straightforward?" 

BK:  Straightforward  URLs.  JavaScript  that's  fairly  clear-cut  or  reused  from  other 
places.  What  we  have  been  really  stumped  on  is  sites  that  need  a  lot  of  JavaScript  or 
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a  lot  of  programs  that  are  needed  to  even  render  the  site  at  all. 

Probably  one  way  of  finding  out  is  going  to  archive.org  and  seeing,  "Did  we  get  it 
right?"  the  last  time.  We're  continuously  updating  our  tools  and  trying  to  make 
things  better.  But  for  instance,  we've  been  having  trouble  with  .swf  files, 
Shockwave  and  Flash  files,  from  Macromedia.  If  those  files  have  links  to  other 
pages  inside  of  them,  we're  just  not  able  to  find  those  links,  so  we  can't  follow  them. 
We  also  have  trouble  rewriting  those  .swf  files  so  that  they  point  to  the  Archive's 
version  of  the  links  and  not  the  live  Web's.  So  we're  having  trouble  with  certain 
complicated  web  sites.  What  we'd  like  to  see  is  more  straightforward  use  of 
pointers,  because  the  hyperlink  is  one  of  the  great  ideas  of  the  Internet. 

Lisa  Rein  is  a  co-founder  of  Creative  Commons,  a  video  blogger  at  On  Lisa  Rein's 
Radar,  and  a  singer-songwriter-musican  at  lisarein.com. 


Return  to  OpenP2P.com. 
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Brewster  Kahle  wants  to  network  San  Francisco.  All  of  it. 

"Basically,  what  we  want  is  the  expectation  that  if  you  open  your  laptop 
or  personal  digital  assistant  (PDA)  at  any  given  place,  you  have  access 
to  the  open  Internet,"  he  says.  And,  for  Kahle,  a  slow  network  isn't 
going  to  cut  it.  He  wants  to  move  data  at  the  speed  of  DVD  video. 
"Search,  click,  see  movies,"  he  emphasizes.  "Now,  we're  not  quite  there 
yet.  But  that's  the  idea." 

It  sounds  like  every  broadband  Internet  service  provider's  fantasy,  but 
it's  maybe  not  as  far-fetched  as  it  seems.  Over  the  last  year  or  so,  small, 
gray  plastic  boxes  have  begun  appearing  atop  homes  and  businesses 
around  San  Francisco.  Roof  by  roof,  they're  bringing  Kahle's  vision  of 
ubiquitous  wireless-network  access  closer  to  reality  ~  no  telephone 
companies  or  cable  providers  required. 

The  boxes  are  the  work  of  two  local  research  ventures,  each  with  the 
goal  of  improving  understanding  of,  and  applications  for,  wireless 
networking.  One  is  SFLan,  a  project  of  the  Internet  Archive,  the 
nonprofit  organization  Kahle  created  in  1996.  The  other  is  the  Bay  Area 
Research  Wireless  Network  (BARWN),  founded  by  networking 
consultant  Tim  Pozar  and  partner  Matt  Peterson. 

Pozar  and  Peterson  designed  the  hardware  shared  by  both  projects  as  a 
means  to  deploy  a  new  kind  of  wireless  network.  Inside  each  gray  box 
are  typical  components  for  a  wireless  access  point:  off-the-shelf  802.1 1 
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network  cards  powered  by  a  small,  single-board  computer.  But,  rather 
than  creating  traditional  wireless  hot  spots  like  those  found  in  hotels  or 
coffee  shops,  Pozar  and  Peterson's  boxes  are  engineered  specifically  for 
long-distance,  outdoor  applications. 

Each  box  incorporates  two  high-gain  antennas.  The  first  is 
omnidirectional,  acting  as  a  beacon  of  wireless  connectivity  to  the 
immediate  area.  The  second  points  straight  at  the  nearest  neighboring 
rooftop  box,  or  perhaps  to  the  locus  of  B ARWN's  most  ambitious 
project  to  date:  a  powerftil  antenna  high  atop  San  Bruno  Mountain.  In 
this  way,  each  box  is  linked  to  a  larger  network,  forming  a  completely 
wireless  backbone  extending  across  the  City  and  to  neighboring 
municipalities. 

If  you're  close  enough  to  a  rooftop  box  (say,  within  a  thousand  feet),  a 
laptop  or  PDA  may  be  all  you  need  to  join  the  SFLan  network.  More 
likely,  however,  you'll  need  additional  amplifiers  or  after-market 
antennas.  The  high-gain  directional  antennas  in  the  gray  boxes 
themselves,  for  example,  can  see  each  other  from  2  miles  away,  or  pull 
down  a  signal  from  the  San  Bruno  Mountain  site  at  a  distance  of  up  to  8 
miles.  Once  the  connection  is  made,  you  have  an  effective  replacement 
for  traditional  broadband. 

"You're  still  going  to  need  something  that  repeats  the  signal  inside  your 
house,"  explains  Kahle.  "You'd  put  one  of  these  [boxes]  up  on  your 
roof  as  a  replacement  for  DSL  or  cable,  and  then  you'd  bring  a  wire 
down  into  your  house  to  either  plug  in  to  your  own  router  or  hub  or  to 
connect  to  your  computer."  At  the  same  time,  the  box  would 
rebroadcast  the  signal  to  your  local  neighborhood  while  paving  the  way 
for  the  next  rooftop  node  to  join  the  network,  forming  a  daisy  chain. 

So  far,  25  of  the  gray  boxes  are  up  and  running  across  San  Francisco, 
and  more  are  on  the  way.  At  about  $1,000  apiece,  they're  not  exactly 
cheap.  But  compared  to  earlier  generations  of  long-distance 
wireless-networking  hardware,  which  sometimes  cost  tens  of  thousands 
of  dollars  per  site,  Pozar's  design  is  a  steal  -  particularly  for 
underdeveloped  areas  where  there  may  be  no  other  solution. 

Take,  for  example.  Hunters  Point,  an  area  long  overlooked  by 
traditional  broadband  providers.  "It's  kind  of  a  catch-22,"  Pozar  says. 
"Since  they  can't  afford  [DSL],  Pac  Bell's  not  going  to  bring  it  in.  They 
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just  can't  justify  going  out  and  loading  up  the  equipment  at  their  central 
offices.  So  a  lot  of  these  places  don't  even  have  DSL  deployed." 

Pozar  is  in  talks  with  co-location  provider  eXchange@200  Paul  to  put 
one  of  BARWN's  gray  boxes  on  the  roof  of  its  facility,  bringing 
broadband  networking  to  the  Hunters  Point  area  for  the  first  time, 
wirelessly.  And  what  works  for  Hunters  Point  could  be  even  more 
beneficial  in  developing  countries. 

In  2001  and  2002,  for  example,  Clif  Cox  used  wireless  technology  to 
bring  telephone  and  data  connectivity  to  the  tiny  Himalayan  kingdom  of 
Bhutan.  Today,  Cox  ~  the  primary  architect  of  SFLan  —  operates  a 
small  business  out  of  his  workshop  in  Eugene,  Ore.,  building  gray 
boxes  based  on  Peterson  and  Pozar's  design.  Besides  supplying  both 
SFLan  and  BARWN,  he's  shipped  similar  hardware  to  clients  in  locales 
as  far-flung  as  the  Galapagos  Islands. 

For  these  areas,  Internet  access  is  often  less  important  than  basic 
communication.  "In  Bhutan,  the  wireless  network  was  primarily  to 
provide  telephone  service,"  explains  Cox.  "If  you  build  a  digital 
network,  then  it's  like  falling  off  a  log  to  send  audio  data  over  it."  As  an 
added  bonus,  such  a  network  allows  for  Internet  access  at  no 
incremental  cost,  when  the  time  comes. 

There's  a  long  list  of  other  roles  wireless  technology  can  play  in  both 
developed  and  developing  communities.  Improving  communications  for 
public-safety  agencies  is  one  ~  for  example,  providing 
multimedia-capable  data  links  for  police  and  fire  department  mobile 
command  centers.  Reducing  barriers  to  education  is  another.  The 
National  Science  Foundation,  for  example,  is  already  exploring  using 
wireless  links  to  facilitate  distance  learning  on  American  Indian 
reservations  in  North  Dakota,  where  school-age  children  may  live  as  far 
as  a  hundred  miles  from  the  nearest  classes. 

Laying  the  groundwork  for  these  types  of  applications  is  an  important 
research  goal  of  SFLan  and  BARWN.  But  many  of  the  participants  in 
the  two  projects  are  equally  motivated  by  something  else:  an  idea 
known  as  community  networking. 

"If  I  have  data  or  resources  in  my  house  and  somebody  lives  50  feet 
away  from  me  in  their  house  and  wants  to  be  able  to  share  data  back 
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and  forth,  it  seems  pretty  silly  that  we  both  have  to  pay  $50  a  month  for 
a  DSL  connection  to  be  able  to  do  that,"  says  Pozar.  "We  should  just  be 
able  to  throw  this  virtual  wire  over  our  fence  and  be  able  to  send  data 
back  and  forth." 

That's  community  networking:  two  people,  each  with  their  own 
resources,  connecting  them  together.  As  a  model,  it  resembles  a  co-op, 
and  in  many  ways  it  closely  mirrors  the  early  roots  of  the  Internet  itself. 

As  Cox  explains  it,  "The  community-network  folks  are  trying  to  build 
the  commons  and  make  something  that  they  feel  should  be  kind  of  like 
a  public  library  ~  like,  free  for  everyone,  just  there,  like  infrastructure. 
Kind  of  like  drinking  fountains  on  the  side  of  the  street.  You  don't  pay 
for  that  water.  It  should  be  like  that.  It's  access  to  information,  it's 
access  to  quality  of  life.  It's  just  participating  in  society.  That's  what 
wireless  technology  is  starting  to  be  associated  with." 

There's  still  a  long  way  to  go  before  ubiquitous  free  wireless  becomes  a 
reality  in  San  Francisco.  But  at  25  gray  boxes  and  counting,  SFLan  and 
BARWN  are  here  today.  And  they're  off  to  a  good  start,  as  the  roughly 
1,000  users  who  have  already  logged  on  to  the  network  anonymously 
can  attest. 

As  to  the  future,  "we  would  just  like  to  see  this  type  of  idea  and 
technology  copied,  and  matured,  by  many  others,"  says  Kahle.  "We 
don't  see  it  as  a  large,  centralized  system.  It's  an  idea  toward  making 
community  networks  operate  at  very  high  speeds  that  have  distributed 
ownership.  We'd  just  basically  like  to  see  bandwidth  spread  like  a  virus 
...  so  it  just  builds  on  itself." 

How  about  that?  A  computer  virus  you  might  actually  want  to  catch  - 
spreading  soon,  to  a  building,  a  street  comer  or  a  park  near  you. 
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Profile:  Book  binding  technology  that  can  make  professional  quality 
books  at  a  low  cost  within  minutes 

January  12,  2004 

RENEE  MONTAGNE,  host:  This  is  I^IORNING  EDITION  from  NPR  News.  I'm  Renee  Montagne. 

Imagine  a  day  when  there's  no  such  thing  as  an  out-of-print  book.  That  day  could  come  soon 
as  a  result  of  an  invention  by  a  California  entrepreneur.  It's  a  book  binding  technology  that 
can  make  professional  quality  books  at  a  low  cost  within  minutes.  NPR's  Laura  Sydell  reports. 

LAURA  SYDELL  reporting: 

A  year  ago,  Kevin  Parker  had  a  breakthrough.  The  manufacturer  of  book  binding  equipment 
developed  a  machine  about  the  size  of  a  desktop  printer  and  a  glue  strip  that  binds 
professional  quality  hard  and  softcover  books  within  minutes.  He  says  the  machine,  which 
costs  as  little  as  $1,300,  will  make  it  possible  for  all  kinds  of  people  to  publish. 

IMr.  KEVIN  PARKER:  Could  be  a  book  that  was  in  production,  but  it's  out  of  print,  and  they 
need  to  just  make  now  an  additional  copy.  Or  it  could  be  somebody  starting  out  that  may  have 
a  great  book  but  can't  get  a  publisher. 

SYDELL:  Industry  observers  say  Parker's  invention  could  change  book  publishing.  Bill  Sullivan, 
president  of  Digital  Binding  Solutions,  is  a  longtime  industry  consultant. 

Mr.  BILL  SULLIVAN  (President,  Digital  Binding  Solutions):  Equipment  like  Kevin's  and  the 
evolution  of  Kevin's  will  now  make  it  possible  for  you  to  go  to  Borders  and  select  a  title  from  a 
file  and  have  a  clerk  reproduce  the  book  for  you  in  several  minutes,  and  you'll  walk  away  with 
a  title  that  you  couldn't  have  purchased  any  other  way  conveniently. 

SYDELL:  Sullivan  says  it's  likely  to  take  several  years  before  most  consumers  see  this  set  up, 
but  Parker  can  wait  a  little  longer.  This  may  sound  strange,  but  he's  been  obsessed  with 
making  the  perfect  glue  strip  for  22  years. 

(Soundbite  of  machinery) 

Mr.  PARKER:  So  this  is  the  heart  of  the  company.  That  sound  or  the  clicking  is  the  blood 
flowing. 

SYDELL:  That  is  actually  the  sound  of  machines  manufacturing  glue  strips  on  the  floor  of  Powis 
Parker,  Kevin  Parker's  Berkeley,  California-based  company.  The  42-year-old  Parker  dropped 
out  of  University  of  California,  Berkeley  to  help  a  friend  with  a  small  office  supply  business 
work  on  developing  a  strip  for  his  Xerox  binding  machine  because  pages  kept  falling  out  of  the 
ones  they  had.  Soon,  the  project  took  over  his  life  and  his  parents'  house. 

Mr.  PARKER:  We  filled  up  the  basement,  and  there  was  this  activity  down  there  at  all  hours. 
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Then  the  yard  started  to  be  populated  with  Army  tents  full  of  scrap  machines  and  supplies. 

SYDELL:  Parker's  parents  moved  out  and  sold  him  the  house  and  he  never  went  back  to 
college.  His  company  now  has  separate  headquarters,  employs  a  hundred  people  and  brings  in 
about  $18  million  in  revenues  through  the  sale  of  book  binding  supplies  and  machines. 
Among  the  fans  of  his  greatest  breakthrough  is  Brewster  Kahle,  founder  of  the  Internet  Archive 
in  San  Francisco.  Kahle  is  using  the  Powis  Parker  Fastback  binding  machine  to  bring  books  to 
children  around  the  world. 

(Soundbite  of  generator) 

SYDELL:  That's  Kahle  revving  up  the  generator  inside  his  Bookmobile  as  he  explains  what's 
inside. 

Mr.  BREWSTER  KAHLE  (Founder,  Internet  Archive):  Anywhere  that  we  can  see  the  sky,  we  can 
make  books  because  we've  got  this  little  family  van  here  that's  got  a  satellite  dish,  a 
computer,  a  printer  and  a  book  binder. 

SYDELL:  Using  a  grant  from  the  World  Bank,  Kahle  sent  a  van  like  this  to  Uganda.  The  van 
visits  villages.  Using  the  satellite  connection,  children  go  on  the  Web  and  select  the  text  of  a 
book  in  the  public  domain. 

Mr.  KAHLE:  Download  the  book,  print  it  and  bind  it,  and  they  end  up  with  a  paperback  book 
that's  of  salable  quality.  I  mean,  you  could  have  this  book  in  a  bookstore,  and  it  only  costs  a 
dollar  to  make  each  book. 

SYDELL:  Kahle  says  on  the  Web,  children  can  even  find  books  in  their  own  language.  He  thinks 
Parker's  technology,  combined  with  low-cost  printers  and  Internet  connections,  has  enormous 
potential  to  make  books  more  affordable.  The  technology  does  raise  the  specter  of  the  kind  of 
piracy  that  the  music  and  film  industries  face  but  Kahle  and  other  observers  think  it's  more 
likely  to  be  used  to  provide  access  to  books  that  would  otherwise  be  out  of  print. 

Parker  is  looking  at  ways  to  get  his  technology  to  a  wider  public  in  this  country.  He's  set  up  a 
model  storefront  and  book-making  shop  around  the  corner  from  his  factory. 

Ms.  CAMILLE  SEAMAN:  Hello.  Hello. 

SYDELL:  Camille  Seaman  is  running  the  shop  for  him. 

Ms.  SEAMAN:  We  needed  to  just  jump  in  and  see  what  was  it  going  to  take?  What  do  the  prices 
need  to  be?  What's  the  interest?  How  can  we  fulfill  the  orders  timely  and  make  it  profitable? 

SYDELL:  Seaman  says  it  cost  them  $20,000  to  set  up  the  storefront  just  a  few  weeks  ago,  but 
it  could  be  done  for  less.  Seaman  says  it's  drawn  interest  from  people  who  want  to  bind  a 
book  of  their  own  photos,  artists  who  want  to  promote  their  work  and  writers  unable  to  find 
publishers.  It  costs  consumers  less  than  $40  to  produce  a  40-page  book  on  high-quality 
paper.  But  Parker  doesn't  want  to  open  a  chain  of  these  shops.  He's  hoping  to  show  others  that 
they  can  do  it.  And  while  many  people  have  been  predicting  the  demise  of  the  book  and  the 
rise  of  the  e-book,  Parker  thinks  otherwise. 

Mr.  PARKER:  People  are  tactile,  and  we  have  a  connection  that  goes  back  thousands  of  years, 
as  far  back  as  papyrus,  of  preserving  our  stories  and  history  and  folklore  in  something  we  can 
hold. 

SYDELL:  Of  course,  there's  an  irony  here,  says  Parker.  It's  new  technologies  like  the  Internet 
and  his  binding  machine  that  will  likely  make  it  possible  for  more  people  to  have  their  own 
books. 
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The  Infinite  Archive 


To  preserve  our  knowledge  base  and  cultures,  we  must  find  a  way  to  save 
digital  content  for  future  generations 

By  Harry  Goldstein 


The  first  great  attempt  to  permanently 
preserve  all  of  recorded  human 


HOLYGRML 

thought — the  legendary  Library  of 
Alexandria,  Egypt — lasted  several  hundred  years  before 
going  up  in  a  giant  inferno  of  crackling  papyrus.  Now,  a 
couple  of  millennia  later,  the  archivist's  worst  fear  is  not 
fire  but  rather  the  ravages  inflicted  by  perpetually 
changing  file  formats  for  documents,  audio,  and  video. 

Librarians  and  computer  scientists  from  Cambridge, 
Mass.,  to,  yes,  Alexandria,  Egypt,  are  working  together  to  | 
make  the  ephemeral  permanent.  Key  projects  include 
the  Massachusetts  Institute  of  Technology's  Dspace 

digital  asset  management  system,  which  aims  to  create  an  institutional  repository  that  will 
include  digitized  versions  of  lecture  notes,  videos,  papers,  and  data  sets — in  short, 
everything  produced  by  faculty  and  staff.  Another  is  the  U.S.  Library  of  Congress'  US 
$100  million  National  Digital  Information  Infrastructure  and  Preservation  Program 
(NDIIPP),  which  is  developing  a  standard  way  for  institutions  to  preserve  their  digital 
archives.  And  in  San  Francisco,  independent  digital  librarian  Brewster  Kahle  is  attempting 
to  preserve  the  content  of  the  Web  on  his  Internet  Archive  (http://www.archive.org).  for 
which  he's  enlisted  the  aid  of  the  Bibliotheca  Alexandrina,  in  Alexandria,  which  hosts  one 
copy  of  the  archive  (two  others  are  in  San  Francisco). 

"Storing  bits  for  100  years  is  easier  than  preserving  content  for  10,"  says  Clay  Shirky,  an 
adjunct  professor  at  New  York  University  and  a  consultant  to  the  Library  of  Congress.  "It 
does  us  no  good  to  store  things  for  1 00  years  if  format  drift  means  our  grandchildren  can't 
read  them." 

There  are  two  approaches  to  digital  archiving:  emulation  and  migration.  With  emulation, 
people  move  architectures  backward  in  time  by  writing  software  that  mimics,  say,  an  Intel 
286  chip.  Using  that  software  lets  you  run  the  old  software  the  chip  ran.  With  migration, 
on  the  other  hand,  archivists  take  documents  created  in  an  obsolete  format  and  simply 
convert  them  to  the  latest  format.  When  another,  newer  format  comes  along,  the 
documents  will  be  converted  again,  ad  infinitum. 
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In  the  long  run,  digital  preservation  cannot  be  done  on  an  ad  hoc  basis.  That's  why  in 
2000,  the  United  States  Congress  authorized  the  NDIIPP,  led  by  the  Library  of  Congress 
in  conjunction  with  various  public  institutions  and  private  companies,  to  come  up  with  a 
national  strategy  to  preserve  digital  content  for  future  generations. 

International  cooperation  is  vital  to  the  quest  for  the  infinite  archive,  too.  Just  this  past  fall, 
the  Library  of  Congress  joined  with  the  national  libraries  of  1 1  other  countries,  including 
Australia,  Britain,  Canada,  Finland,  France,  Iceland,  and  Sweden,  to  form  a  consortium  to 
develop  tools  to  capture  and  share  digital  content. 

According  to  Laura  Campbell,  the  library's  associate  librarian  for  strategic  initiatives,  the 
consortium  will  begin  by  developing  specifications  for  crawler  tools  and  "will  work  toward 
a  general  framework  within  which  we  can  all  work  and  ultimately  manage  digital  content." 
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Guinness:  Scientist  creates  world^s 
largest  book 

At  133  pounds,  light  reading  it's  not 

CAMBRIDGE,  Massachusetts  (AP)  --A  133-pound  tome  about  the  Asian  country  of 
Bhutan  that  uses  enough  paper  to  cover  a  football  field  and  a  gallon  of  ink  has  been 
declared  the  world's  largest  published  book. 

Author  Michael  Hawley,  a  scientist  at  the  Massachusetts  Institute  of  Technology, 
said  it's  not  a  book  to  curl  up  with  at  bedtime  ~  "unless  you  plan  to  sleep  on  it." 

Each  copy  of  "Bhutan:  A  Visual  Odyssey  Across  the  Kingdom,"  is  5-by-7  feet,  1 12 
pages  and  costs  about  $2,000  to  produce.  Hawley  is  charging  $10,000  to  be  donated 
to  a  charity  he  founded,  Friendly  Planet,  which  has  built  schools  in  Cambodia  and 
Bhutan. 

Guinness  World  Records  has  certified  Hawley's  work  as  the  biggest  published  book, 
according  to  Stuart  Claxton,  a  Guinness  researcher. 

Hawley  has  led  a  number  of  MIT  student  expeditions  to  Cambodia  and  Bhutan,  an 
isolated  country  of  700,000  people  that  is  about  the  size  of  Switzerland,  and  thought 
he  could  raise  money  for  education  there  by  putting  together  some  of  the  thousands 
of  photographs  he  was  gathering. 

He  said  he  did  not  set  out  to  make  the  world's  largest  book.  But  playing  around  in 
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CNN.com  -  Guinness:  Scientist  creates  world's  largest  b...      http://cnn.technology.printthis.clickability.com/pt/cpt7ac... 

his  office  at  MIT's  Media  Lab  with  a  state-of-the  art  digital  printer,  Hawley 
discovered  just  how  spectacular  large,  digital  images  can  look  ~  especially  of 
Bhutan,  a  country  flush  with  colorful  scenery  and  dress  where  even  the  rice  is  red. 

"What  I  really  wanted  was  a  5-by-7-foot  chunk  of  wall  that  would  let  me  change  the 
picture  every  day,"  he  said.  "And  I  thought  there  was  an  old-fashioned  mechanism 
that  might  work.  It's  called  the  book." 

Hawley  said  he's  received  about  two  dozen  orders  for  the  book,  which  includes  an 
easel-like  stand.  Early  customers  include  Brewster  Kahle,  the  inventor  of  the 
Internet  Archive  project,  who  has  known  Hawley  for  years  through  his  computer 
science  work  at  MIT. 

"You  deal  with  a  book  in  a  fundamentally  new  way,"  Kahle  said  when  asked  about 
the  appeal,  adding  he  wasn't  certain  how  he  would  display  his  copy.  "You  meet  it 
eye-to-eye,  like  a  person." 

Processing  and  printing  the  images  took  enormous  chunks  of  computing  power, 
much  of  it  donated  by  companies  including  Dell,  Apple  Computers  and  Kodak. 
Then  there  was  the  assembly.  At  this  size,  the  normal  physics  of  bookbinding 
simply  don't  apply. 

"All  my  traditional  techniques  for  binding  books  are  impossible,"  said  ACME 
Bookbinding  President  Paul  Parisi.  Zeff  Hanower,  a  shop  machinist,  had  to  build  an 
assembly  line  from  scratch.  ACME  also  used  an  "accordion"  style  of  binding  to 
ensure  the  book  folded  and  held  together  properly. 

Hawley  said  his  research  revealed  that  the  biggest  book  in  the  Library  of  Congress 
was  John  J.  Audubon's  19th  century  "Birds  of  America,"  which  is  2.5-by-3.5  feet. 
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MP3.com  founder  fights  to  save  the  music 


By  Mike  Freeman 

UNION-TRIBUNE  STAFF  WRITER 

November  27,  2003 

The  founder  of  San  Diego  Internet  music  site  MP3.com  is  trying  to  save  the 
company's  database  of  more  than  1  million  songs  from  being  erased  when  the  site 
shuts  down  next  week. 

Michael  Robertson  has  contacted  MP3.com  owner  Vivendi  Universal  in  hopes  of 
persuading  the  company  not  to  delete  the  songs  from  its  computer  servers. 
Vivendi  Universal  has  said  it  will  erase  the  database,  which  mostly  consists  of 
obscure  songs  from  independent  musicians,  after  it  completes  the  sale  of  the 
MP3.com  Internet  domain  and  brand  names  to  CNET  Networks  on  Tuesday. 

Robertson  sold  MP3.com  to  Vivendi  Universal  about  three  years  ago,  and  he  is  no 
longer  part  of  the  company.  Still,  he's  aiming  to  save  the  database  as  a  piece  of 
Internet  history.CNET  did  not  acquire  the  database  when  it  purchased  the 
MP3.com  name. 

Robertson  said  he  sees  the  database  as  an  archive  of  the  early  days  of  digital 
online  music,  dating  all  the  way  back  to,  well,  the  mid-1990s. 

"The  music  efforts  making  news  today  are  standing  on  the  shoulders  of  these  early 
MP3  pioneers,"  Robertson  said  in  a  written  statement.  "A  majority  of  this  music 
cannot  be  found  anywhere  else  in  the  world." 

Robertson  is  lobbying  for  Vivendi  Universal  to  give  permission  to  Archive.org,  a 
not-for-profit  Internet  library  for  digital  material,  to  store  the  songs  on  its  servers. 
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"We're  about  to  lose  a  museum  filled  with  digital  antiquities  that  are  every  bit  as 
meaningful  as  their  physical  counterparts  filling  today's  museums,"  Robertson 
said. 

Vivendi  Universal  could  not  be  reached  for  comment.  But  Brewster  Kahle,  digital 
librarian  at  Archive.org,  said,  "I  think  Michael  Robertson  is  still  talking  with 
lawyers"  at  the  music  company  over  saving  the  database. 

"If  it  stays  in  the  land  of  lawyers,  I  don't  have  much  hope,"  he  said. 

Robertson's  final  efforts  are  perhaps  a  fitting  requiem  for  MP3.com's  raucous 
six-year  run  in  the  online  music  business. 

In  early  2000,  hoping  to  pry  open  the  Internet  distribution  of  music,  Robertson 
copied  more  than  45,000  commercial  CDs  from  big-name  artists  to  the  company's 
database  without  authorization  from  the  record  companies.  The  move  resulted  in  a 
flurry  of  copyright  infringement  lawsuits.  MP3.com  eventually  paid  $150  million 
to  settle  the  suits,  a  move  that  depleted  the  company's  cash. 

Vivendi  Universal  bought  MP3.com  for  $372  million,  netting  Robertson  an 
estimated  $103  million,  in  2001.  But  the  company  failed  to  make  its  Internet 
music  initiatives  work  and  eventually  sold  its  online  music  businesses,  with  the 
MP3.com  domain  name  being  the  last  to  go. 

The  company's  database  consists  of  songs  mostly  from  little-known  musicians 
who  posted  their  music  in  hopes  of  gaining  exposure  and  fans.  Experts  doubt  that 
the  database  is  worth  much. 

"There's  a  lot  of  stuff,  and  there  may  be  a  few  gems,"  said  Phil  Leigh,  an  analyst 
with  Inside  Digital  Media,  an  industry  research  firm.  "It  would  probably  take 
somebody  like  Michael  Robertson  who  has  the  knowledge  of  what's  in  there  to 
realize  any  value." 

MP3.com  didn't  screen  its  bands.  Any  musician,  no  matter  how  good  or  bad, 
could  upload  songs  to  the  site. 

"Ninety-nine  percent  is  total  crap  that's  up  there,"  said  Martin  Lindhe  of  Bassic, 
one  of  the  most  successful  independent  artists  on  MP3.com,  with  users 
downloading  more  than  9  million  of  Lindhe's  songs.  "That's  the  problem  with  the 
database,  a  lack  of  filtering.  It's  just  open  to  anybody." 
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But  because  of  that,  Lindhe  thinks  the  database  has  historical  value  and  should  be 
saved.  He's  also  disappointed  that  independent  artists  are  losing  a  well  known 
venue  for  getting  their  music  heard. 

"For  independent  music,  this  is  a  giant  step  backward,"  he  said. 

CNET,  however,  said  yesterday  that  it  plans  to  create  an  Internet  hosting  site  for 
independent  musicians  in  early  2004  patterned  after  MP3.com. 

The  service,  developed  by  CNET's  Download.com,  wasn't  planned  when  the 
company  bought  the  MP3.com  name.  But  it  has  been  added  in  "direct  response  to 
the  feedback  received  from  the  artist  community,"  the  company  said  in  a 
statement. 

CNET  said  independent  musicians  who  want  to  post  songs  on  its  new  service 
should  make  backup  copies  of  their  music  files  and  other  data  hosted  on  MP3.com 
before  Tuesday.  That  way  the  copies  can  be  added  to  the  new  CNET  site  when  it 
launches  next  year. 

CNET  said  directions  on  how  to  make  backup  copies  can  be  found  at 
http://music.download.com. 


Mike  Freeman:  (619)  293-1 5 1 5;mike. freeman@uniontrib.com 
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On  the  Web,  Research  Work 
Proves  Ephemeral 

Electronic  Archivists  Are  Playing  Catch-Up  in  Trying  to 
Keep  Documents  From  Landing  in  History's  Dustbin 

By  Rick  Weiss 

Washington  Post  Staff  Writer 

Monday,  November  24,  2003;  Page  A08 

It  was  in  the  mundane  course  of  getting  a  scientific 
paper  published  that  physician  Robert  Dellavalle  came 
to  the  unsettling  realization  that  the  world  was 
dissolving  before  his  eyes. 

The  world,  that  is,  of  footnotes,  references  and  Web 
pages. 

Dellavalle,  a  dermatologist  with  the  Veterans  Affairs 
Medical  Center  in  Denver,  had  co-written  a  research 
report  featuring  dozens  of  footnotes  ~  many  of  which 
referred  not  to  books  or  journal  articles  but,  as  is 
increasingly  the  case  these  days,  to  Web  sites  that  he 
and  his  colleagues  had  used  to  substantiate  their 
findings. 

Problem  was,  it  took  about  two  years  for  the  article  to 
wind  its  way  to  publication.  And  by  that  time,  many  of 
the  sites  they  had  cited  had  moved  to  other  locations  on 
the  Internet  or  disappeared  altogether,  rendering  useless 
all  those  Web  addresses  ~  also  known  as  uniform 
resource  locators  (URLs)  ~  they  had  provided  in  their 
footnotes. 


On  demand 
business  is... 

How  to  tell  anyone 
anything  without 
telling  everyone 
everything. 


"Every  time  we  checked,  some  were  gone  and  others  had  moved,"  said  Dellavalle, 
who  is  on  the  faculty  at  the  University  of  Colorado  Health  Sciences  Center.  "We 
thought,  'This  is  an  interesting  phenomenon  itself.  We  should  look  at  this.' " 
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He  and  his  co-workers  have  done  just  that,  and  what  they  have  found  is  not 
reassuring  to  those  who  value  having  a  permanent  record  of  scientific  progress.  In 
research  described  in  the  journal  Science  last  month,  the  team  looked  at  footnotes 
from  scientific  articles  in  three  major  journals  ~  the  New  England  Journal  of 
Medicine,  Science  and  Nature  --  at  three  months,  1 5  months  and  27  months  after 
publication.  The  prevalence  of  inactive  Internet  references  grew  during  those 
intervals  from  3.8  percent  to  10  percent  to  13  percent. 

"I  think  of  it  like  the  library  burning  in  Alexandria,"  Dellavalle  said,  referring  to 
the  48  B.C.  sacking  of  the  ancient  world's  greatest  repository  of  knowledge. 
"We've  had  all  these  hundreds  of  years  of  stuff  available  by  interlibrary  loan,  but 
now  things  just  a  few  years  old  are  disappearing  right  under  our  noses  really 
quickly." 

Dellavalle's  concerns  reflect  those  of  a  growing  number  of  scientists  and  scholars 
who  are  nervous  about  their  increasing  reliance  on  a  medium  that  is  proving  far 
more  ephemeral  than  archival.  In  one  recent  study,  one-fifth  of  the  Internet 
addresses  used  in  a  Web-based  high  school  science  curriculum  disappeared  over 
12  months. 

Another  study,  published  in  January,  found  that  40  percent  to  50  percent  of  the 
URLs  referenced  in  articles  in  two  computing  journals  were  inaccessible  within 
four  years. 

"It's  a  huge  problem,"  said  Brewster  Kahle,  digital  librarian  at  the  Internet  Archive 
in  San  Francisco.  "The  average  lifespan  of  a  Web  page  today  is  100  days.  This  is 
no  way  to  run  a  culture." 

Of  course,  even  conventional  footnotes  often  lead  to  dead  ends.  Some  experts 
have  estimated  that  as  many  as  20  percent  to  25  percent  of  all  published  footnotes 
have  typographical  errors,  which  can  lead  people  to  the  wrong  volume  or  issue  of 
a  sought-after  reference,  said  Sheldon  Kotzin,  chief  of  bibliographic  services  at 
the  National  Library  of  Medicine  in  Bethesda. 

But  the  Web's  relentless  morphing  affects  a  lot  more  than  footnotes.  People  are 
increasingly  dependent  on  the  Web  to  get  information  from  companies, 
organizations  and  governments.  Yet,  of  the  2,483  British  government  Web  sites, 
for  example,  25  percent  change  their  URL  each  year,  said  David  Worlock  of 
Electronic  Publishing  Services  Ltd.  in  London. 
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That  matters  in  part  because  some  documents  exist  only  as  Web  pages  --  for 
example,  the  British  government's  dossier  on  Iraqi  weapons.  "It  only  appeared  on 
the  Web,"  Worlock  said.  "There  is  no  definitive  reference  where  future  historians 
might  find  it." 

Web  sites  become  inaccessible  for  many  reasons.  In  some  cases  individuals  or 
groups  that  launched  them  have  moved  on  and  have  removed  the  material  from 
the  global  network  of  computer  systems  that  makes  up  the  Web.  In  other  cases  the 
sites'  handlers  have  moved  the  material  to  a  different  virtual  address  (the  URL  that 
users  type  in  at  the  top  of  the  browser  page)  without  providing  a  direct  link  from 
the  old  address  to  the  new  one. 

When  computer  users  try  to  access  a  URL  that  has  died  or  moved  to  a  new 
location,  they  typically  get  what  is  called  a  "404  Not  Found"  message,  which 
reads  in  part:  "The  page  cannot  be  displayed.  The  page  you  are  looking  for  is 
currently  unavailable." 

So  common  are  such  occurrences  today,  and  so  iconic  has  that  message  become  in 
the  Internet  era,  that  at  least  one  eclectic  band  has  named  itself  "404  Not  Found," 
and  humorists  have  launched  countless  knockoffs  of  the  page  -  including 
www,  mams  ell  e.  ca/error.  html,  which  looks  like  a  standard  error  page  but  scolds 
people  for  spending  too  much  time  on  their  computers  ("This  page  cannot  be 
displayed  because  you  need  some  fresh  air  .  .  .")  and 

www.coxar.pwp.blueyonder.co.uk,  which  offers  political  commentary  about  the 
U.S.  war  in  Iraq  ("The  weapons  you  are  looking  for  are  currently  unavailable."). 

Not  all  apparently  inaccessible  Web  sites  are  really  beyond  reach.  Several 
organizations,  including  the  popular  search  engine  Google  and  Kahle's  Internet 
Archive  (www.archive.org),  are  taking  snapshots  of  Web  pages  and  archiving 
them  as  fast  as  they  can  so  they  can  be  viewed  even  after  they  are  pulled  down 
from  their  sites.  The  Internet  Archive  already  contains  more  than  200  terabytes  of 
information  (a  terabyte  is  a  million  million  bytes)  --  equivalent  to  about  200 
million  books.  Every  month  it  is  adding  20  more  terabytes,  equivalent  to  the 
number  of  words  in  the  entire  Library  of  Congress. 

"We're  trying  to  make  sure  there's  a  good  historical  record  of  at  least  some  subsets 
of  the  Web,  and  at  least  some  record  of  other  parts,"  Kahle  said.  "We're  injecting 
the  past  into  the  present." 

But  with  an  estimated  7  million  new  pages  added  to  the  Web  every  day,  archivists 
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can  do  little  more  than  play  catch-up.  So  others  are  creating  new  indexing  and 
retrieval  systems  that  can  find  Web  pages  that  have  wandered  to  new  addresses. 

One  such  system,  known  as  DOI  (for  digital  object  identifier),  assigns  a  virtual 
but  permanent  bar  code  of  sorts  to  participating  Web  pages.  Even  if  the  page 
moves  to  a  new  URL  address,  it  can  always  be  found  via  its  unique  DOI. 

Standard  browsers  cannot  by  themselves  find  documents  by  their  DOIs.  For  now, 
at  least,  users  must  use  go-between  "registration  agencies"  ~  such  as  one  called 
CrossRef  ~  and  "handle  servers,"  which  together  work  like  digital  switchboards  to 
lead  subscribers  to  the  DOI-labeled  pages  they  seek.  A  hodgepodge  of  other 
retrieval  systems  is  cropping  up,  as  well  ~  all  part  of  the  increasingly  desperate 
effort  to  keep  the  ballooning  Web's  thoughts  accessible. 

If  it  all  sounds  complicated,  it  is.  But  consider  the  stakes:  The  Web  contains 
unfathomably  more  information  than  did  the  Alexandria  library.  If  our  culture 
ends  up  unable  to  retrieve  and  use  that  information,  then  all  that  knowledge  will, 
in  effect,  have  gone  up  in  smoke. 

Research  editor  Margot  Williams  contributed  to  this  report. 

©  2003  The  Washington  Post  Company 
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